New Method of Spam Filtering

← Back to Stories (view on slashdot.org)

Posted by CmdrTaco on Thursday February 19, 2004 @06:32AM from the something-to-read dept.

Alephcat writes "A simple and easily implemented scheme for combating e-mail spam has been devised by two researchers in the United States. P. Oscar Boykin and Vwani Roychowdhury of the University of California, Los Angeles use their method to exploit the structure of social networks to quickly determine whether a given message comes from a friend or a spammer. The method works for only about half of all e-mails received - but in all of those cases, it sorts the mail into the right category. The article was published on Nature magazines website earlier today."

9 of 326 comments (clear)

Min score:

Reason:

Sort:

Re:Easily spoofed? by cavebear42 · 2004-02-19 06:39 · Score: 4, Informative

as i understand it, they would have to spoof to someone who you know, a virus could easily do that (after it has your address book) but not so much for spam.
So it's just a very good rule, how is that bad? by Smack · 2004-02-19 06:46 · Score: 5, Informative

According to the article, it can make a decision on 53% of the total e-mail, and divide it up into Spam or non-Spam with complete accuracy. The key is that it makes no judgement on the rest of the e-mail.

So you could throw this as a rule into SpamAssassin with a 100 weight on Spam results and a -100 weight on non-Spam results. That could only help your filtering. With zero false-positives.
Re:Easily spoofed? by FauxPasIII · 2004-02-19 06:47 · Score: 5, Informative

There are two 'sender' fields that one is concerned with: The envelope-sender and the From: header. The latter can be spoofed as much as you like. The former cannot be spoofed in most cases, at least the host/domain part (the username can be spoofed if the server uses unauthenticated SMTP, which almost all do).

A typical message would look like this:
From spammer@baddomain.com From: Your friend <yourfriend@gooddomain.org> Subject: Re: your mail Buy our crap ! Click below to be removed. Blah blah.

The first From field is the 'envelope sender' and comes entirely from the servers that have touched the mail. The rest of the fields are just a freeform part of the message, which by convention most (all?) MUA's treat in a special way to add convenient features like having the 'real name' next to your mail address in the visible From: field.

--
25% Funny, 25% Insightful, 25% Informative, 25% Troll
Link to the Research Paper by Nepre · 2004-02-19 06:49 · Score: 4, Informative

The actual paper that describes this technique can be found here
How it works - clustering coefficients by blorg · 2004-02-19 06:57 · Score: 5, Informative

You can read an abstract, and download the full (e.g. original) article here in a variety of formats.
From what I can make out, this system graphs correspondent pairs into correspondence maps, and notes that while normal people all email each other and thus have dispersed graphs, (high clustering coefficient) spammers have a distinct pattern, e.g. 1 person emailing a few million others (low clustering coefficient). There are figures in the article that make this point well.
The system would be ideal for implementation at a fairly high level, (e.g. the ISP level) where systems can aggregate email headers across many different users in order to come up with meaningful graphs. The advantage it claims of no false positives means that it would be feasible at this level.
I'm impressed; it looks like a very clever idea. My only question concerns how this would deal with mailing lists, which must appear to it like spam?
Re:Easily spoofed? by mlefevre · 2004-02-19 07:05 · Score: 5, Informative

The envelope-sender can be just as easily spoofed as the From: header. If you're sending email out through your ISP or corporate email relay, that may well check that the host (or the whole address) is correct.

If you do as most spammers do and connect directly to the receiving server, then you can feed it whatever you like in the envelope sender, and it has no way of checking whether it's genuine or not. This is what stuff like SPF can help with, but as things are currently implemented just about everywhere, the envelope-sender addresses on spam and viruses are generally forged.
Erm, not by Vainglorious+Coward · 2004-02-19 07:06 · Score: 5, Informative

The [envelope-sender] cannot be spoofed in most cases
Simply : untrue. It's as easy to fake the envelope sender as it is the From: header. I think you're getting confused with "Received" headers, where each mail system inserts its own bit of tracking information. The envelope-sender is completely under the control of the sender, and (usually) propagates un-modified as an email is handed between systems (indeed, one of the criticisms of SPF is that by modifying the envelope sender you break forwarding).

--
My next sig will be ready soon, but subscribers can beat the rush
Sorry: that link is the full pdf, here's abstract by blorg · 2004-02-19 07:06 · Score: 4, Informative

Sorry, that is a link the entire pdf of the article. This is the abstract, which you may as well have here if I'm posting again (on the linked page, you also have other formats available, as well as mirrors):
We provide an automated graph theoretic method for identifying individual users' trusted networks of friends in cyberspace. We routinely use our social networks to judge the trustworthiness of outsiders, i.e., to decide where to buy our next car, or to find a good mechanic for it. In this work, we show that an email user may similarly use his email network, constructed solely from sender and recipient information available in the email headers, to distinguish between unsolicited commercial emails, commonly called "spam", and emails associated with his circles of friends. We exploit the properties of social networks to construct an automated anti-spam tool which processes an individual user's personal email network to simultaneously identify the user's core trusted networks of friends, as well as subnetworks generated by spams. In our empirical studies of individual mail boxes, our algorithm classified approximately 53% of all emails as spam or non-spam, with 100% accuracy. Some of the emails are left unclassified by this network analysis tool. However, one can exploit two of the following useful features. First, it requires no user intervention or supervised training; second, it results in no false negatives i.e., spam being misclassified as non-spam, or vice versa. We demonstrate that these two features suggest that our algorithm may be used as a platform for a comprehensive solution to the spam problem when used in concert with more sophisticated, but more cumbersome, content-based filters.
Re:Easily spoofed? by Vainglorious+Coward · 2004-02-19 07:48 · Score: 4, Informative

Isn't it typical for the receiver to reverse-lookup the sender's IP, or at least forward-lookup whatever you hand it in the HELO to make sure you're legit ?
Some systems do this, but any sensible system will not reject solely on this basis because it breaks delivery of some legitimate messages. In particular, nowhere does it say that mail "from" a particular domain has to emanate from a particular host (there's no analogue to MX for *sending* hosts). That's what SPF and similar techniques are trying to impose - registered "senders" for a particular domain.

--
My next sig will be ready soon, but subscribers can beat the rush