New Method of Spam Filtering
Alephcat writes "A simple and easily implemented scheme for combating e-mail spam has been devised by two researchers in the United States. P. Oscar Boykin and Vwani Roychowdhury of the University of California, Los Angeles use their method to exploit the structure of social networks to quickly determine whether a given message comes from a friend or a spammer. The method works for only about half of all e-mails received - but in all of those cases, it sorts the mail into the right category. The article was published on Nature magazines website earlier today."
as i understand it, they would have to spoof to someone who you know, a virus could easily do that (after it has your address book) but not so much for spam.
According to the article, it can make a decision on 53% of the total e-mail, and divide it up into Spam or non-Spam with complete accuracy. The key is that it makes no judgement on the rest of the e-mail.
So you could throw this as a rule into SpamAssassin with a 100 weight on Spam results and a -100 weight on non-Spam results. That could only help your filtering. With zero false-positives.
There are two 'sender' fields that one is concerned with: The envelope-sender and the From: header. The latter can be spoofed as much as you like. The former cannot be spoofed in most cases, at least the host/domain part (the username can be spoofed if the server uses unauthenticated SMTP, which almost all do).
A typical message would look like this:
From spammer@baddomain.com
From: Your friend <yourfriend@gooddomain.org>
Subject: Re: your mail
Buy our crap ! Click below to be removed. Blah blah.
The first From field is the 'envelope sender' and comes entirely from the servers that have touched the mail. The rest of the fields are just a freeform part of the message, which by convention most (all?) MUA's treat in a special way to add convenient features like having the 'real name' next to your mail address in the visible From: field.
25% Funny, 25% Insightful, 25% Informative, 25% Troll
The actual paper that describes this technique can be found here
From what I can make out, this system graphs correspondent pairs into correspondence maps, and notes that while normal people all email each other and thus have dispersed graphs, (high clustering coefficient) spammers have a distinct pattern, e.g. 1 person emailing a few million others (low clustering coefficient). There are figures in the article that make this point well.
The system would be ideal for implementation at a fairly high level, (e.g. the ISP level) where systems can aggregate email headers across many different users in order to come up with meaningful graphs. The advantage it claims of no false positives means that it would be feasible at this level.
I'm impressed; it looks like a very clever idea. My only question concerns how this would deal with mailing lists, which must appear to it like spam?
The envelope-sender can be just as easily spoofed as the From: header. If you're sending email out through your ISP or corporate email relay, that may well check that the host (or the whole address) is correct.
If you do as most spammers do and connect directly to the receiving server, then you can feed it whatever you like in the envelope sender, and it has no way of checking whether it's genuine or not. This is what stuff like SPF can help with, but as things are currently implemented just about everywhere, the envelope-sender addresses on spam and viruses are generally forged.
Simply : untrue. It's as easy to fake the envelope sender as it is the From: header. I think you're getting confused with "Received" headers, where each mail system inserts its own bit of tracking information. The envelope-sender is completely under the control of the sender, and (usually) propagates un-modified as an email is handed between systems (indeed, one of the criticisms of SPF is that by modifying the envelope sender you break forwarding).
My next sig will be ready soon, but subscribers can beat the rush
We provide an automated graph theoretic method for identifying individual users' trusted networks of friends in cyberspace. We routinely use our social networks to judge the trustworthiness of outsiders, i.e., to decide where to buy our next car, or to find a good mechanic for it. In this work, we show that an email user may similarly use his email network, constructed solely from sender and recipient information available in the email headers, to distinguish between unsolicited commercial emails, commonly called "spam", and emails associated with his circles of friends. We exploit the properties of social networks to construct an automated anti-spam tool which processes an individual user's personal email network to simultaneously identify the user's core trusted networks of friends, as well as subnetworks generated by spams. In our empirical studies of individual mail boxes, our algorithm classified approximately 53% of all emails as spam or non-spam, with 100% accuracy. Some of the emails are left unclassified by this network analysis tool. However, one can exploit two of the following useful features. First, it requires no user intervention or supervised training; second, it results in no false negatives i.e., spam being misclassified as non-spam, or vice versa. We demonstrate that these two features suggest that our algorithm may be used as a platform for a comprehensive solution to the spam problem when used in concert with more sophisticated, but more cumbersome, content-based filters.
Some systems do this, but any sensible system will not reject solely on this basis because it breaks delivery of some legitimate messages. In particular, nowhere does it say that mail "from" a particular domain has to emanate from a particular host (there's no analogue to MX for *sending* hosts). That's what SPF and similar techniques are trying to impose - registered "senders" for a particular domain.
My next sig will be ready soon, but subscribers can beat the rush