New Method of Spam Filtering
Alephcat writes "A simple and easily implemented scheme for combating e-mail spam has been devised by two researchers in the United States. P. Oscar Boykin and Vwani Roychowdhury of the University of California, Los Angeles use their method to exploit the structure of social networks to quickly determine whether a given message comes from a friend or a spammer. The method works for only about half of all e-mails received - but in all of those cases, it sorts the mail into the right category. The article was published on Nature magazines website earlier today."
as i understand it, they would have to spoof to someone who you know, a virus could easily do that (after it has your address book) but not so much for spam.
The fact that competant mail admins know how to prevent such stupidity from happening.
Every wonder why worms use their own SMTP engine? Because those of us that are competent have one mail relay that only accepts messages from the internal domain. We prevent the worm's SMTP engine from working by having MX wildcard records to a logging box only for internal DNS - this ensures that any message sent from an internal box that gets out goes through the relay, which authenticates the user.
Viruses are a different kind of spam. They actually come from someone you know (or might know.) Regular spam has those headers forged (and getting those right would rise costs of a single message, which is good.)
According to the article, it can make a decision on 53% of the total e-mail, and divide it up into Spam or non-Spam with complete accuracy. The key is that it makes no judgement on the rest of the e-mail.
So you could throw this as a rule into SpamAssassin with a 100 weight on Spam results and a -100 weight on non-Spam results. That could only help your filtering. With zero false-positives.
There are two 'sender' fields that one is concerned with: The envelope-sender and the From: header. The latter can be spoofed as much as you like. The former cannot be spoofed in most cases, at least the host/domain part (the username can be spoofed if the server uses unauthenticated SMTP, which almost all do).
A typical message would look like this:
From spammer@baddomain.com
From: Your friend <yourfriend@gooddomain.org>
Subject: Re: your mail
Buy our crap ! Click below to be removed. Blah blah.
The first From field is the 'envelope sender' and comes entirely from the servers that have touched the mail. The rest of the fields are just a freeform part of the message, which by convention most (all?) MUA's treat in a special way to add convenient features like having the 'real name' next to your mail address in the visible From: field.
25% Funny, 25% Insightful, 25% Informative, 25% Troll
The actual paper that describes this technique can be found here
From what I can make out, this system graphs correspondent pairs into correspondence maps, and notes that while normal people all email each other and thus have dispersed graphs, (high clustering coefficient) spammers have a distinct pattern, e.g. 1 person emailing a few million others (low clustering coefficient). There are figures in the article that make this point well.
The system would be ideal for implementation at a fairly high level, (e.g. the ISP level) where systems can aggregate email headers across many different users in order to come up with meaningful graphs. The advantage it claims of no false positives means that it would be feasible at this level.
I'm impressed; it looks like a very clever idea. My only question concerns how this would deal with mailing lists, which must appear to it like spam?
No.
This system can vouch for half of your email, that it's either friend or spam. This means it correctly categorizes half of the email, and leaves the other half unknown.
A random number generator could assign all of your email to friend or spam, randomly. But it wouldn't do it all correctly.
Duh.
Namely, when someone joins a spam and non-spam component of the network.
PS: This method was tested on email boxes from the "Real World", but of course, we could use more email boxes to test with. Please send me a tarball of all your email and I will tune the algorithm! :)
jabber: johnynek@jabber.org
The envelope-sender can be just as easily spoofed as the From: header. If you're sending email out through your ISP or corporate email relay, that may well check that the host (or the whole address) is correct.
If you do as most spammers do and connect directly to the receiving server, then you can feed it whatever you like in the envelope sender, and it has no way of checking whether it's genuine or not. This is what stuff like SPF can help with, but as things are currently implemented just about everywhere, the envelope-sender addresses on spam and viruses are generally forged.
Simply : untrue. It's as easy to fake the envelope sender as it is the From: header. I think you're getting confused with "Received" headers, where each mail system inserts its own bit of tracking information. The envelope-sender is completely under the control of the sender, and (usually) propagates un-modified as an email is handed between systems (indeed, one of the criticisms of SPF is that by modifying the envelope sender you break forwarding).
My next sig will be ready soon, but subscribers can beat the rush
We provide an automated graph theoretic method for identifying individual users' trusted networks of friends in cyberspace. We routinely use our social networks to judge the trustworthiness of outsiders, i.e., to decide where to buy our next car, or to find a good mechanic for it. In this work, we show that an email user may similarly use his email network, constructed solely from sender and recipient information available in the email headers, to distinguish between unsolicited commercial emails, commonly called "spam", and emails associated with his circles of friends. We exploit the properties of social networks to construct an automated anti-spam tool which processes an individual user's personal email network to simultaneously identify the user's core trusted networks of friends, as well as subnetworks generated by spams. In our empirical studies of individual mail boxes, our algorithm classified approximately 53% of all emails as spam or non-spam, with 100% accuracy. Some of the emails are left unclassified by this network analysis tool. However, one can exploit two of the following useful features. First, it requires no user intervention or supervised training; second, it results in no false negatives i.e., spam being misclassified as non-spam, or vice versa. We demonstrate that these two features suggest that our algorithm may be used as a platform for a comprehensive solution to the spam problem when used in concert with more sophisticated, but more cumbersome, content-based filters.
The parent is overrated.
Forging an envelope sender is trivial: "telnet mailhost 25" and break out your best SMTP rap.
This won't work because the incoming and outgoing mail servers of just about any large organization have nothing to do with each other.
In fact one of the rules I use blocks messages that claim to come from the MXes of certain large service providers because such messages are 100% spam from spammers who already thought of your idea.
we tried to implement this very method. it had very good results in drastically reducing the spam levels we were getting. Unfortunately, it also excluded small business and .orgs who didn't have their mail servers entered correctly if at all in the DNS. Although the "unclean" but legit mail servers were only about 2-3% of the total incoming mail, it was still enough "false positives" to make us have to open up the fort again. :(
until everyone jumps on the bandwagon of MX registration, this method won't work. Required SMTP auth would be nice--at least it would be a bit more traceable. As long as 1/10th of 1% of spammers reply to spam msgs, then those damn spammers will think it's profitable. spammers die!
I might know what I'm talkin' about, but then again, this is Slashdot...
It'd still be bayesian, except that word frequencies and graph connectivity of sender would _both_ be considered for additional spam probability. I don't have a filter to check, but don't most Bayesian classifiers also include other metrics besides top 20 word frequency, like length or presence of attachments, etc.?
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
SpamAssassin does this. They use a genetic algorithm to calculate the best weights to give all of the tests they have, where 'best' = least false positives and most accurate positives (on their 'standard' spam/ham corpus).
There is already something out there that's pretty similar to what you're suggesting. It's called Sender Policy Framework.
Basically, as part of your DNS entry, you have a record containing a list of all of the addresses that are allowed to send email on your domain's behalf. I think there was a story on Slashdot a few weeks ago about it as AOL has starting using it.
"People that quote themselves in their signatures bother me" - athakur999
Some systems do this, but any sensible system will not reject solely on this basis because it breaks delivery of some legitimate messages. In particular, nowhere does it say that mail "from" a particular domain has to emanate from a particular host (there's no analogue to MX for *sending* hosts). That's what SPF and similar techniques are trying to impose - registered "senders" for a particular domain.
My next sig will be ready soon, but subscribers can beat the rush
I send you and your sister a spam. While both of you are getting the spam, to both of you I am an unknown and therefore the system would flag me. ONLY if I send the spam to you while pretending to be your sister would the system break. I would need to know both your email and the email of someone you know. This would not be impossible to harvest with virusses stealing addressbooks but is not what is currently happening. Currently email address lists used by spammers are very simple flat text files. Of course nothing complex would be needed. Simply a similar text file but now with two emails per line. The first the recipient, the second the person to forge as the sender. Simple but more work.
So it looks like a pretty clever idea. Especially for work place email where most mail is by people you know and very little email from outside usually arrives. And even when it is done it is usually from a known domain namely a client or supplier.
Will it work? Who knows. Gotta be worth a try. Unless you want to wait for Bill Gates to fix it. We all know how well the security problems in windows were fixed eh?
There is not going to be a magic bullet that fixes spam. We will just have to use a lot of ordinary lead ones. Don't worry Bush says they are safe.
MMO Quests are like orgasms:
You may solo them, I prefer them in a group.
FOAF is an open XML/RDF standard for describing these social networks, it seems like that would be a good way to implement this. Plus, since it uses SHA1 sums of email addresses it would be possible to check addresses without giving them up to spammers.
A lot of sites like Tribe.net and my own project SongBuddy are working on integrating FOAF into the site, so that you won't have to worry about the mechanics of it unless you want to. Seems like an easy way to build these kind of white lists.
Why can't I moderate something "Wrong" or at least "Grossly Misinformed"?
no what they meant was that 50% of all email messages are sorted into "friends" or "spam" correctly...the other 50% aren't sorted into either, but rather considered "undetermined"
Another possible problem could be confirmation emails when you sign up for a mailing list or message board or something. This would be even more dificult to tell from spam than newsletters. Also you have no way of knowing the email address it will come from to add it to a whitelist.
"It is not how things are in the world that is mystical, but that it exists." -Ludwig Wittgenstein