Slashdot Mirror


New Method of Spam Filtering

Alephcat writes "A simple and easily implemented scheme for combating e-mail spam has been devised by two researchers in the United States. P. Oscar Boykin and Vwani Roychowdhury of the University of California, Los Angeles use their method to exploit the structure of social networks to quickly determine whether a given message comes from a friend or a spammer. The method works for only about half of all e-mails received - but in all of those cases, it sorts the mail into the right category. The article was published on Nature magazines website earlier today."

34 of 326 comments (clear)

  1. Everytime you filter spam... by Anonymous Coward · · Score: 5, Funny

    You take food away from a spammer and his children. Don't block spam, or else you hate childeren. You don't hate children... do you?

  2. Vwani Roychowdhury by Anonymous Coward · · Score: 5, Funny

    He was probably sick of people like me mistaking his name for a made up spam "from" line.

  3. Interesting by jchawk · · Score: 5, Interesting

    It would be interesting if Google could find away for this idea to work with Orkut.com, since users of this service are typically connected to many other people who are not spammers. :-)

  4. Easily spoofed? by Sam+Ruby · · Score: 5, Insightful

    What's to stop the From:, To:, and Cc: fields from being spoofed (like a lot of viruses do)?

    --
    - Sam Ruby
    1. Re:Easily spoofed? by cavebear42 · · Score: 4, Informative

      as i understand it, they would have to spoof to someone who you know, a virus could easily do that (after it has your address book) but not so much for spam.

    2. Re:Easily spoofed? by FauxPasIII · · Score: 5, Informative

      There are two 'sender' fields that one is concerned with: The envelope-sender and the From: header. The latter can be spoofed as much as you like. The former cannot be spoofed in most cases, at least the host/domain part (the username can be spoofed if the server uses unauthenticated SMTP, which almost all do).

      A typical message would look like this:

      From spammer@baddomain.com
      From: Your friend <yourfriend@gooddomain.org>
      Subject: Re: your mail

      Buy our crap ! Click below to be removed. Blah blah.


      The first From field is the 'envelope sender' and comes entirely from the servers that have touched the mail. The rest of the fields are just a freeform part of the message, which by convention most (all?) MUA's treat in a special way to add convenient features like having the 'real name' next to your mail address in the visible From: field.

      --
      25% Funny, 25% Insightful, 25% Informative, 25% Troll
    3. Re:Easily spoofed? by mlefevre · · Score: 5, Informative

      The envelope-sender can be just as easily spoofed as the From: header. If you're sending email out through your ISP or corporate email relay, that may well check that the host (or the whole address) is correct.

      If you do as most spammers do and connect directly to the receiving server, then you can feed it whatever you like in the envelope sender, and it has no way of checking whether it's genuine or not. This is what stuff like SPF can help with, but as things are currently implemented just about everywhere, the envelope-sender addresses on spam and viruses are generally forged.

    4. Re:Easily spoofed? by Vainglorious+Coward · · Score: 4, Informative
      Isn't it typical for the receiver to reverse-lookup the sender's IP, or at least forward-lookup whatever you hand it in the HELO to make sure you're legit ?

      Some systems do this, but any sensible system will not reject solely on this basis because it breaks delivery of some legitimate messages. In particular, nowhere does it say that mail "from" a particular domain has to emanate from a particular host (there's no analogue to MX for *sending* hosts). That's what SPF and similar techniques are trying to impose - registered "senders" for a particular domain.

      --
      My next sig will be ready soon, but subscribers can beat the rush
  5. Volume by enderanjin · · Score: 4, Interesting

    If the filters are effective against only half of the emails, what is preventing spammers from doubling their load in order to control the same amount of spam getting to your inbox as they do now?

    --
    Anything in parenthesis may (not) be ignored.
  6. huh? by wankledot · · Score: 4, Interesting
    It only works for half... but it works great on that half!!! How is that a good filter at all?

    Of course one huge downside to this "friend of friends" approach is all the virus spam I get that's sent using someone's address book (thanks Outlook!) Guess what... all those addresses are probably whitelisted because it came from someone I "know."

    --
    My sig is blank, I typed this by hand.
    1. Re:huh? by CeleronXL · · Score: 5, Interesting

      Well you can run mail through a system like that first, pulling out the mail that is definitely not spam and shuffling it away to the Inbox. Then run it through a different kind of spam system, such as a system like SpamBayes, and you cut it down even more.

      On its own it doesn't sound like it works well, but you can couple it with already-existing systems to boost accuracy.

    2. Re:huh? by nick_davison · · Score: 4, Funny

      Hey, don't knock a filter that can correctly sort mail in to two piles fifty percent of the time. CoinToss 1.0 has been a real innovation!

  7. Cleaning up the gene pool by Anonymous Coward · · Score: 5, Funny

    Spammers suck, right? And their children have obviously inherited the spamming gene. So, by starving the children to death, we're preventing the spam gene from spreading. It may sound wrong, but we're actually helping society.

  8. Bugger Off! by ackthpt · · Score: 5, Interesting
    You take food away from a spammer and his children. Don't block spam, or else you hate childeren. You don't hate children... do you?

    You know darn well that this will only increase employment in the Spam Technology sector and is a good thing.

    Seriously, Spammers are often a step ahead and lately a lot of spam I'm getting is masked to look like Amazon orders or closed ebay auctions. I haven't ordered anything from Amazon (USA) in ages, but I till have to peek to see if someone has cracked my account and ordered something. Just expect the harder they are pressed, the harder spammers will press back by sinking to new lows.

    --

    A feeling of having made the same mistake before: Deja Foobar
  9. Good idea by Schezar · · Score: 5, Interesting

    After reading this, I realized that a good 90% of the email I receive is either from someone I've had previous contact with, or else someone 1 or at most 2 degrees of separation from one of those people. I never get mail worth reading from total strangers. Anything important is always linked back to me in some way.

    It should be interesting to see how this method plays out. (Now, I don't know why I even bothered with that last sentence. Everyone says that about every new spam-filtery thing. ((Don't know why I bothered with that last sentence either. Work is slow today I suppose.)) )

    --
    GeekNights!
    Late Night Radio for Geeks!
  10. A two tier system? by erick99 · · Score: 4, Interesting
    I suppose you could use this as a first pass and let those go directly to the "recycle bin" or whatever deletes mail (if you really can be confident that they are all spam). Then, the balance of your email could go through whatever antispam system you use. Right now I get over 100 spam emails a day. These go into a folder and are sorted by sender so that I can quickly scan through for any "friendly" emails. If would be nice to cut down the amount that has to be manually scanned by a half. Either way, this sounds like it's going in the right direction - towards a system that is close to 100% effective (if that is truly possible).

    Happy Trails!

    Erick

    --
    http://www.busyweather.com/
  11. Spam filtering by eclectro · · Score: 5, Funny


    If it doesn't use bullets, I don't want to hear about it.

    --
    Take the cheese to sickbay, the doctor should see it as soon as possible - B'Elanna Torres, "Learning Curve"
  12. I don't always like my friends' friends by Clemence · · Score: 5, Funny

    Can't stop the friend-of-a-friend idiot who hits "reply to all."

    It might not be "spam" but I filter it now. I'll stick with my procmail filters.

  13. Re:Sounds interesting... by rjelks · · Score: 4, Insightful

    I would agree with that in terms of personal email accounts, but for a business, new contacts are pretty important. Most companies would hope a lot of real email was from new sources.

    -

  14. Heading the wrong way by Muddie · · Score: 5, Interesting

    This sounds like the whole "Friends and Family" network from AT&T a few years ago, and now Verizon's "In" network thing, but with email and exclusive instead of "Free calls to friends on 'the list'".

    Pretty soon, you will have to send an MD5 hash of your DNA from a static IP address that is reversible and supply 5 refrences all in a PGP encrypted letter, along with a copy of your passport and birth certificate.

    When it's more work to block spam than stop it, you have to ask what is going wrong. Maybe if we somehow figured out wonderful technologies to *stop* spammers instead of blocking them, we'd be getting towards the ultimate goal. This is much like throwing money at a problem to bandage it, not fix it. The solution, however, also has to be easier for end users, who are doing nothing wrong. Why is every solution harder for end users, but just a 'bump in the road' for spammers? Am I missing something?

  15. (OT sig response) by jridley · · Score: 4, Funny

    Member of the Stop Fucking Saying 'M$' army

    Right, from now on, it's "micros~1" for me.

  16. Spammers already defeat this (partially) by xleeko · · Score: 5, Interesting
    Spammers already sort addresses by site in order to take advantage of this effect. They forge the from address as someone else from your site on the theory that you know them and would whitelist them.

    In fact, this has provided me with a kind of "honeypot", since I now check for the addresses of several people who are long gone from my site. If I see their address its gotta be spam!

    - Dave

  17. So it's just a very good rule, how is that bad? by Smack · · Score: 5, Informative

    According to the article, it can make a decision on 53% of the total e-mail, and divide it up into Spam or non-Spam with complete accuracy. The key is that it makes no judgement on the rest of the e-mail.

    So you could throw this as a rule into SpamAssassin with a 100 weight on Spam results and a -100 weight on non-Spam results. That could only help your filtering. With zero false-positives.

    1. Re:So it's just a very good rule, how is that bad? by GooberToo · · Score: 4, Interesting

      Or simply not process the 53% with other spam detection software, which saves on CPU! In other words, make this the first anti-spam process, whereby, half of your email gets to skip spamassassin (or whatever). The other 50%, you process as usual.

  18. This method will ruin a cool part of the net by The+Wing+Lover · · Score: 5, Insightful

    Used to be that one of the cool things about the net was that you would get email from total strangers... "Hi, I'm from {some far away place}. I saw your {Usenet post|web page|profile on some bulletin board site} and really liked your ideas about {something}. I've also been experimenting with {something} and I have some ideas about {whatever}..."

    Now, if we only have emails from our (already existing) friends or friends of friends, then how will we ever meet anybody new?

    --

    - In Capitalist America, law violates YOU!

  19. Link to the Research Paper by Nepre · · Score: 4, Informative

    The actual paper that describes this technique can be found here

  20. How it works - clustering coefficients by blorg · · Score: 5, Informative
    You can read an abstract, and download the full (e.g. original) article here in a variety of formats.

    From what I can make out, this system graphs correspondent pairs into correspondence maps, and notes that while normal people all email each other and thus have dispersed graphs, (high clustering coefficient) spammers have a distinct pattern, e.g. 1 person emailing a few million others (low clustering coefficient). There are figures in the article that make this point well.

    The system would be ideal for implementation at a fairly high level, (e.g. the ISP level) where systems can aggregate email headers across many different users in order to come up with meaningful graphs. The advantage it claims of no false positives means that it would be feasible at this level.

    I'm impressed; it looks like a very clever idea. My only question concerns how this would deal with mailing lists, which must appear to it like spam?

    1. Re:How it works - clustering coefficients by orthogonal · · Score: 4, Insightful

      The system would be ideal for implementation at a fairly high level, (e.g. the ISP level) where systems can aggregate email headers across many different users in order to come up with meaningful graphs. The advantage it claims of no false positives means that it would be feasible at this level.

      Yeah, but I'd consider a high-level analysis of my email headers (either sent or received) to be a violation of my privacy. Whether or not I'm mailing to kinky@alterate.life.styles.com, fringe.politcal.groups.require@free.speech.too.org , unpopular.opinions@free.thinkers.net, or falun.gong@is.banned.by.my.dictator.org, it should be nobody's business but my own.

      Someone will undoubtedly argue that since headers are sent in the clear anyway, it shouldn't matter, but keeping a database of who mails what to whom only makes abuse -- by freelance busybodies or government spies and censors -- that much the easier.

      This is a case, I think, were the threat inherent in the cure is worse than the disease.

  21. Erm, not by Vainglorious+Coward · · Score: 5, Informative
    The [envelope-sender] cannot be spoofed in most cases

    Simply : untrue. It's as easy to fake the envelope sender as it is the From: header. I think you're getting confused with "Received" headers, where each mail system inserts its own bit of tracking information. The envelope-sender is completely under the control of the sender, and (usually) propagates un-modified as an email is handed between systems (indeed, one of the criticisms of SPF is that by modifying the envelope sender you break forwarding).

    --
    My next sig will be ready soon, but subscribers can beat the rush
  22. Sorry: that link is the full pdf, here's abstract by blorg · · Score: 4, Informative
    Sorry, that is a link the entire pdf of the article. This is the abstract, which you may as well have here if I'm posting again (on the linked page, you also have other formats available, as well as mirrors):

    We provide an automated graph theoretic method for identifying individual users' trusted networks of friends in cyberspace. We routinely use our social networks to judge the trustworthiness of outsiders, i.e., to decide where to buy our next car, or to find a good mechanic for it. In this work, we show that an email user may similarly use his email network, constructed solely from sender and recipient information available in the email headers, to distinguish between unsolicited commercial emails, commonly called "spam", and emails associated with his circles of friends. We exploit the properties of social networks to construct an automated anti-spam tool which processes an individual user's personal email network to simultaneously identify the user's core trusted networks of friends, as well as subnetworks generated by spams. In our empirical studies of individual mail boxes, our algorithm classified approximately 53% of all emails as spam or non-spam, with 100% accuracy. Some of the emails are left unclassified by this network analysis tool. However, one can exploit two of the following useful features. First, it requires no user intervention or supervised training; second, it results in no false negatives i.e., spam being misclassified as non-spam, or vice versa. We demonstrate that these two features suggest that our algorithm may be used as a platform for a comprehensive solution to the spam problem when used in concert with more sophisticated, but more cumbersome, content-based filters.

  23. Mailing lists / newsletters by blorg · · Score: 4, Insightful
    A mailing list would have multiple folks in the To: line, which would be easy to spot automatically.

    Not necessarily, indeed most professional ones avoid this. While many spams do contain multiple people in the To: field (but also many don't). One way or the other, I don't think this is relevant if we are trying to compare the graph of a mailing list to that of a spammer. To take an example, user slashdot-headlines@newsletters.osdn.com sends thousands of emails to people *who don't know each other*. User enlargeyourdong@hotmail.com has exactly the same pattern. How do you tell these apart?

  24. Some of us rely on e-mail from strangers by beagle72 · · Score: 5, Insightful

    The proposed anti-spam clustering technique is of course a variation on whitelisting. While clever, it fails to address a problem I have not often seen addressed. Many people defend themselves from spam by obscuring their e-mail addresses in public places, and perhaps by using whitelists to prefer known senders. This may be effective for many people.

    However, some of us can't avoid having a publically available e-mail address. For example, writers such as myself rely on feedback from readers who are, in nearly all cases, strangers (and sometimes strange, but that's another story...) Avoiding false positives from strangers is very important to me. I want their messages. But, since my e-mail address is published frequently (hence no reason to hide it here), I obviously receive a ton of spam.

    For the past few months I have experimented with a plug-in called BayesIt! for the Windows email reader The Bat!. As the name implies, it's a bayesian filter. The nice thing about BayesIt is that I could point it to my already-stuffed spam folder and train it on thousands of messages in one go. So far it has worked out rather well. No false positives, and only about 10-20 false negatives per day (out of approx. 400 spams).

    Still, in the long run I support proposals that shift the economics of e-mail in ways that have minimal impact on human beings while making spam unprofitable. Changing the economic model of spam is the only sure solution; relying solely on technology will simply keep us locked in an ongoing arms race.

    -Aaron

  25. Most newsletters are one-way by blorg · · Score: 4, Insightful
    Easy - those thousands of people who don't know each other also send email *back* to the mailing list. Only a few dummies send email back to the spammers.

    Most mailinglists and newsletters are one way - I'm not talking about discussion lists or listservs, but rather about the bot that sends me Slashdot headlines, Jakob Nielsens' Alertbox, Fred Langa's newsletter, and even commercial speech that I am signed up to and want to hear such as Komplett's weekly offers, or Ryanair's cheap flights, etc.

  26. HOW SPMAMMERS CAN BEAT THIS FILTER by goombah99 · · Score: 4, Interesting

    There are three ways one can beat the filter.

    The first is trivial and certain to succeed but has a Drawback to spammers: only send e-mail to single recpients. The drawback is this puts a much higher load on their servers since every message is sent individually.

    The second method is to always include dummy addresses in the mailing list that the recpients probably have in their address books. For example, add the following names to the to-field: notifications@paypal.com and list-notication@ebay.com.
    Any recpieint that of the spam message that also has recieved e-mail from e-bay or pay-pal will trust the message.

    One can do even better by planning ahead when harvesting e-mails. For example, if you harvest a set of e-mails from a pqarticular bulliten board you can make note of message cliques at the time of harvesting, and send messages in the same groupings. for good measure you also send the addresses of the buliten board admins as well.

    Third, all the spammer really has to do is to know is one recipient you have gotten messages from. Thus either buy mailing lists from legitimate companies people actually do bussniess with. Or create your own loss-leader messages. For example, send out some political action alert or anything that has some vlaue or use to most people, maybe a lottery drawing for a prize, or a discount subsciption to time magazine, so they will accpet the message. the sender does not have to be the same as your spammer address. Now you know someone in the adress book of the victim. Now you spam the crap out of them while including the trojan address in the to: field.

    --
    Some drink at the fountain of knowledge. Others just gargle.