New Method of Spam Filtering
Alephcat writes "A simple and easily implemented scheme for combating e-mail spam has been devised by two researchers in the United States. P. Oscar Boykin and Vwani Roychowdhury of the University of California, Los Angeles use their method to exploit the structure of social networks to quickly determine whether a given message comes from a friend or a spammer. The method works for only about half of all e-mails received - but in all of those cases, it sorts the mail into the right category. The article was published on Nature magazines website earlier today."
It would be interesting if Google could find away for this idea to work with Orkut.com, since users of this service are typically connected to many other people who are not spammers. :-)
If the filters are effective against only half of the emails, what is preventing spammers from doubling their load in order to control the same amount of spam getting to your inbox as they do now?
Anything in parenthesis may (not) be ignored.
Of course one huge downside to this "friend of friends" approach is all the virus spam I get that's sent using someone's address book (thanks Outlook!) Guess what... all those addresses are probably whitelisted because it came from someone I "know."
My sig is blank, I typed this by hand.
isn`t this somewhat similar to thunderbirds function not to mark those in your mailinglist as spam ?
Doolittle :
Bomb no.20 : To explode of course.
Won't this just inspire more spammers to pursue virus, trojan and spyware-oriented methods of spamming? Granted, this is significantly more difficult than just harvesting email addresses off of Usenet and web pages, but it seems like we're only one step ahead at any given time with our methods of spam prevention.
You know darn well that this will only increase employment in the Spam Technology sector and is a good thing.
Seriously, Spammers are often a step ahead and lately a lot of spam I'm getting is masked to look like Amazon orders or closed ebay auctions. I haven't ordered anything from Amazon (USA) in ages, but I till have to peek to see if someone has cracked my account and ordered something. Just expect the harder they are pressed, the harder spammers will press back by sinking to new lows.
A feeling of having made the same mistake before: Deja Foobar
After reading this, I realized that a good 90% of the email I receive is either from someone I've had previous contact with, or else someone 1 or at most 2 degrees of separation from one of those people. I never get mail worth reading from total strangers. Anything important is always linked back to me in some way.
It should be interesting to see how this method plays out. (Now, I don't know why I even bothered with that last sentence. Everyone says that about every new spam-filtery thing. ((Don't know why I bothered with that last sentence either. Work is slow today I suppose.)) )
GeekNights!
Late Night Radio for Geeks!
What about spoofed messages from people on my list?
Worms, from infected email systems?
The researchers didn't address this.
Money cannot buy happiness, but can buy something soo darn close, that you can't really tell the difference
Happy Trails!
Erick
http://www.busyweather.com/
This seems to be a good start, but it still requires software on the user side. And that software must work with their mail client...
I guess it seems this is where the focus has become. While some spam can be blanketed and deleted, it's really up to the RECIPIENT to judge whether its spam or not.
But then again, do we trust the user? Do we trust Joe and Jane (our loving SixPack couple) to make the right decision? Sure, it might be prudent in a company of 5-50, but what about 500-5000? Deploy and manage copies of these program to see if it's going right or not?
I'm a sysadmin and I prefer the server based solution. Blacklists, SpamAssassin, et. al. Easier to fix one machine than 5000 desktops.
Comments?
When modding "Informative", please make sure it both has a source and IS actually informative.
This sounds like the whole "Friends and Family" network from AT&T a few years ago, and now Verizon's "In" network thing, but with email and exclusive instead of "Free calls to friends on 'the list'".
Pretty soon, you will have to send an MD5 hash of your DNA from a static IP address that is reversible and supply 5 refrences all in a PGP encrypted letter, along with a copy of your passport and birth certificate.
When it's more work to block spam than stop it, you have to ask what is going wrong. Maybe if we somehow figured out wonderful technologies to *stop* spammers instead of blocking them, we'd be getting towards the ultimate goal. This is much like throwing money at a problem to bandage it, not fix it. The solution, however, also has to be easier for end users, who are doing nothing wrong. Why is every solution harder for end users, but just a 'bump in the road' for spammers? Am I missing something?
I would like to share in all humility my own method of spam filtering:
;-)
I use a super-extra-secret e-mail that I give only to my friends.
Have you Meta Meta Moderated lately?
This certainly needs to be combined with a revamped SMTP system (or complete replacement) that enforces DNS-style From: lookups.
So no, this certainly isn't a solution all by itself. It's the best one I've seen so far that doesn't involve more laws, though.
Most of the other ideas surrounding DNS lookups are to enforce accurate From: lines. But then the ideas break down, with the best suggestions to be new laws to punish the sender of the spam. With the proposal here today, it can be done with technology instead of waiting for legislation.
It doesn't hurt to be nice.
Though I'm no fan of Microsoft or Bill Gates, the solution proposed by them - one where a complicated math calculation is required for every mail they send - is on the right track because at least, in theory, it becomes expensive to send mail and therefore spammers are at a disadvantage. If this is to be a really workable solution, only time will tell - and given the MS tradition of hype ... who knows.
Schemes that make it expensive for the handlers (networks, ISPs) or the recipients, are not the way to go. After reading the article, it seems that this is just another one of those.
I've been swashdotted -- Elmer Fudd
In fact, this has provided me with a kind of "honeypot", since I now check for the addresses of several people who are long gone from my site. If I see their address its gotta be spam!
- Dave
This may be a reasonablesolution to the drive-by spaming that occurs onlivejournal.you can easily create a web-o-trust given the closedfriendly nature of the 'friends' networks.
It only works on 50%, but it claims *no false positives* on that 50%. That means that that 50% can be deleted immediately; no-one has to check in case there is a false positive. By contrast, Bayesean filters *will* produce the occasional false positive, so you have to trawl through your spam folder occasionally to check against this. If I could reduce my spam folder checking from 200 mails a day to 100, I'd be very happy.
Many people need to receive email from people they've never met, like prospective customers.
How did this get in to Nature? There are far better anti-spam tools like spamassassin & popfile that are far more effective against spam than this technique.
The issue is recieving.. Yes, you can EASILY block outbound, it's inbound that's an issue.
"We prevent the worm's SMTP engine from working by having MX wildcard records to a logging box only for internal DNS -"
Say what? Why wouldn't you just block outbound port 25 from anyone expect YOUR SMTP server's address? If a worm has it's own SMTP engine (many do, yes), then what's to stop it from doing it's own MX look-ups? It would take about 4 extra lines of code to accomplish this.
Mod +5 Drunk
The Bayesian rule is just a mechanism for combining multiple independent estimates into an overall estimate.
This is clearly an independent estimate, and a good mechanism to improve the overall detection probability.
What we need is a "meta-Bayesian" process that appropriately weights and combines other spam prediction estimates, not just word counts.
People who disagree with you are not automatically evil, greedy, or stupid.
I never thought that Slashdot would help me find papers relevant to my research!
I think that their idea is good from a technical point of view, but very bad from a privacy point of view. I am of the opinion that gathering social network information is extremely dangerous. A pertinent example: If your friend is branded a "terrorist," then "they" can exploit the information that you have voluntarily provided to then put you on a "terrorist" watch list.
Another example: Say that someone who knows someone that you know actually buys something from a spam. If the spammer can access the social network information, suddenly your little niche of the network is going to be aggressively spammed. After all, like minds congregate.
There is no doubt in my mind that the black hatters will infiltrate the social network communities and use that information to spy on potential viewers. See this bugzilla thread where the folks from Atriks Professional Email Deployment Service follow SpamAssassin's development and adapt their "ratware" tool accordingly.
The biggest problem with collecting social networks is that once the data has been gathered, it is very hard to control. Those of you using Orkut should think long and hard about it.
In conclusion, I think that this is technically a good idea but it opens a Pandora's box.
I've been thinking about this method for a while - basically, you configure your SMTP server to do this:
This idea is cleary too simple to have not been thought of before - but I have yet to find a good explanation as to why it won't work. Verizon.net uses this exact method - try sending a SMTP message from a host that isn't listed in your domain's MX records, you get a 550 Sorry, you aren't allowed to mail for this domain". or something comparable. How come this method isn't more widely used? Going through my own SMTP server logs show that the vast majority of SMTP servers sending legit mail are also listed in the domain's MX records. The only price is that you require the sender and receiver to be the same within a domain - hardly an unreasonable requirement.
to deal with open relays in China...
I would ve harvested the emails of as many members of the ruling communist party as possible, and used those relays to spam them with anti-communist propaganda. I believe the consequences would've been swift and ruthless.
Unfortunately I cant read/write Chinese, and this idea wouldnt work in less repressive regimes...
Or simply not process the 53% with other spam detection software, which saves on CPU! In other words, make this the first anti-spam process, whereby, half of your email gets to skip spamassassin (or whatever). The other 50%, you process as usual.
Easy - those thousands of people who don't know each other also send email *back* to the mailing list. Only a few dummies send email back to the spammers.
For something based on statistics, the difference would likely be very noticeable.
This is right on the mark. I think that if this system was widely implemented then we would begin to see more email virus based spamming. Essentually using the infected people to do the spamming to all of the people in their address book. This would in a sense defeat the whitelist method.
In response to the quote aobut counting the stars, you could use a monte carlo method to count a few stars in random portions of the sky to get a fairly accurate count of all the visible stars.
Oh ya, in case it's not obvious, that means up to a 50% reduction in the small percent of email which are false-positives. That means, if you have a 5% false-positive, you *may* see that reduced to as little as 2.5%! Technically, it may actually be higher than that. The reason being, it may be that 100% of the false-positives fall into the 50% that this technique properly identifies. Needless to say, that's very exciting. It also means that it creates the possibility to allow people to lower their spam threshold without fear for creating a higher false-positive hit rate. That in turn, means more spam identified with fewer false positives. Let's hope reality false close to my rambling speculations here! ;)
Very interesting indeed!
Isn't this scheme the perfect use for the wide-ranging social network information being collected by Plaxo?
It makes sense - they certainly haven't annouced a revenue stream yet, and "keeping your address book up-to-date," even in a wireless and multiplatform world just doesn't seem like a big enough idea to justify the huge amounts of data collected.
So is that the annoucement that's coming from Plaxo, the unveiling of a broad Spam solution that used 'degrees of separation' data from your address book and the address books of your friends to implement a spam filtering solution?
If I may say, it does seem like the killer app for their unique data set.
-------
Believe me, I'm as surprised by my comment as you are.
"as i understand it, they would have to spoof to someone who you know, a virus could easily do that (after it has your address book) but not so much for spam."
And virus-infected machines are being used to send spam, they're also capable of swapping email address details between machines?
Coincidence? You'd better hope the spammers think so.
I think this could be pretty easily beaten, and I'm surprised my spam isn't already showing this characteristic, now that I think about it...
(all spammers, please don't read anymore below here, I don't want to give you ideas).
In my example, I get spam sent to me and several other people at my work. It would be trivial for spammers to modify their algorithms so that instead of sending to x people in my office, they send to (x-1) people in my office, and use that last address as the "From" field. Of course, you could set up your email server to detect this (mail coming from outside claiming to be from inside). Does Exchange Server provide this kind of functionality? If not, it would be all too easy for spammers to break this method.
It should be illegal to say that freedom of speech should be limited.
There are three ways one can beat the filter.
The first is trivial and certain to succeed but has a Drawback to spammers: only send e-mail to single recpients. The drawback is this puts a much higher load on their servers since every message is sent individually.
The second method is to always include dummy addresses in the mailing list that the recpients probably have in their address books. For example, add the following names to the to-field: notifications@paypal.com and list-notication@ebay.com.
Any recpieint that of the spam message that also has recieved e-mail from e-bay or pay-pal will trust the message.
One can do even better by planning ahead when harvesting e-mails. For example, if you harvest a set of e-mails from a pqarticular bulliten board you can make note of message cliques at the time of harvesting, and send messages in the same groupings. for good measure you also send the addresses of the buliten board admins as well.
Third, all the spammer really has to do is to know is one recipient you have gotten messages from. Thus either buy mailing lists from legitimate companies people actually do bussniess with. Or create your own loss-leader messages. For example, send out some political action alert or anything that has some vlaue or use to most people, maybe a lottery drawing for a prize, or a discount subsciption to time magazine, so they will accpet the message. the sender does not have to be the same as your spammer address. Now you know someone in the adress book of the victim. Now you spam the crap out of them while including the trojan address in the to: field.
Some drink at the fountain of knowledge. Others just gargle.
suppose a spammer harvests from a social network site and spoofs their source address to be from harvested addresses... it's pretty likely 2 people on the same social network site will be within eachother's threshhold if only the to/from/cc headers are used...
maybe more sophistocated techniques to include the source IP subnet or something? Some sender verification would be required.
Before I saw your posting, I was thinking that perhaps one way to deal with it would be for a similar approach to the "social networks" and "web of trust" ones to be applied to the servers and networks themselves: each network could keep a list of mail servers on other networks that they trust to not be open relays or spam hosts, etc, and for mail sent from other servers, they could check the lists that other trusted networks keep. They could then choose to add those servers to their own lists too if they turned out to be OK. Some means would need to be made for new servers to be able to get on somebody's list, of course...
But the point is ultimately, that dealing with the Spam issue by filtering on the content is just stupid, it's a losing battle as they keep finding new stupid ways to get past the filters, and the filters will always have some risk of blocking legitimate emails. What if I send a parody of a spam to a friend as a joke? And if we only use filters at the user's end, the burden of the traffic is still felt by our ISPs and email providers. There HAS to be a way to block it at the source.
Be careful! New moon tonight.
I have my own domain, and run my own mail server for personal email. The ONE thing that I have done to reduce incoming spam drastically(i.e. I only get 5% as much now), is to refuse incoming connections to the mail server from any machine that does not have a valid rDNS value. I may miss email from someone, but, they'll have gotten a(n) (somewhat) informative message telling them why their email did not succeed. They can either complain to their ISP and get their rDNS fixed (like I did :-) or call me/send me a letter.
Tom.
Essentially, that is a short description of how a "Chain of Trust", or better named a "Web of Trust" works in GPG. You have people who verify that person A knows the private key A_1 the corresponds to public key A_2.
Even if they don't bother encrypting everything, but just digitally sign it. It's also just an anti-spam filter, so I'm even less worried about having the key be encrypted. Now, I can go sign any key, with my key rating how "trustworthy" I deem people. You get a 5 if you are really trustworhty, and a 0 if I deem you absolutely untrustworthy.
From there, you can build layers of trust, trusting the ratings of people you trust, on and on, until you establish a relationship thru the web between you and the sender.
Now the problem, is that there is no marginal benefit, an it'd be very hard to get the users individually to do this. So, I'd suggest that the SMTP servers do this themselves. You create a web of trust that is only for SMTP servers. You register you key on the web. You send people some e-mail. Eventually, they'll e-mail the admin of the E-mail servers you communicate with regularly telling them asking them to review their logs and sign your key. Ask your friend, peers, clients, vendors, and/or upstream providers to sign the keys deeming you trustworthy.
They do this, and your on the web of trust. You find a mail that doesn't do this, view it as suspcious. You find one that is signed with an SMTP key that is known to have sent spam by someone you trust, you drop it on the floor.
Then you can start to trust SMTP servers. It has all of the advantages of SPF, and has some type of cryptographic security, plus doesn't allow spammers to just setup SPF records bogusly and get away with it. They'll have ton continuiously try and get new keys that are deemed trustworthy.
Assuming you have any friends, who have friends outside your clique, it should be relatively easy to get a foothold in the web of trust. Everybody who befriends a Spammer will be deemed "untrustworthy" in short order. So you won't trust people they trust. Eventually the system should balance out. No work need change by individual users. Mail Admin's could communicate with each other and make the system work. About the only real problem, is that it puts extra load on any mail server. Depending on the volume of mail you have, just setup 2 or 3 inboud/outbound sendmail servers that you queue to. Their sole job is to verify and/or add the digital signature/encryption to mail.
Webs of trust are a well understood animal in GPG land. While I'm not terrible conversant with them, they are essentially a distributed rating system by which rankings and trust worthyness can be ascertained about people you've never met. Think of it as a better system, with more flexibility then Karma + Karma Modifiers + Friend/Foe on Slashdot.org
Kirby