Paul Graham: Filters that Fight Back
Mortimer.CA writes "Paul Graham is back with another article about combating spam. It's entitled Filters that Fight Back: 'One intriguing idea is to literally fight back: to make filters disable spammers' servers by automatically following all the links in each incoming email. We may be driven to this in order to achieve accurate filtering anyway. Why wait?' One danger is someone doing a DDoS by sending fake spam."
In response to the comment: "One danger is someone doing a DDoS by sending fake spam"
From the article notes: "[5] The best way to protect against abuse might be to have the central authority whitelist every site by default, and then, by whatever protocol, take certain sites off. Because you can look at the sites before taking them off the whitelist, there is little danger of people abusing this system to attack an innocent site."
Why do I h8 apple?
I recently switched from a keyword-based spam filter to a bayesian filter. However, there exists several bayesian filter projects and the choice of which to use is not obvious. Therefore, I decided to do an actual test and write up my findings in a review so others can benefit as well. Read it and find out how to win the War on spam.
My hotmail account gets relentlessly spammed even though I _never_ follow any links from spam or let it load any images. Even before Hotmail introduced the "don't load inline images" feature I always disabled javascript + images before opening any suspected spam.
Basically, can it get worse? They never seem to remove inactive accounts anyway.
I have a domain registered which I've owned for three years, and it's still getting spam for accounts related to the previous owner of said domain. My mailer says "no such account" over and over and over again.
Spammers don't care whether the account exists, is inactive, filtered or whatever. They try to spam it anyway.
Belief is the currency of delusion.
there is no 'fake' spam
Not true; several times I have received spams so carefully put together that they looked like they came from one of my addresses. For example, I used to have an address like me@school.edu; it's been inactive for some time, but once in a while I'll get a message claiming to be from that address, complete with perfectly spoofed headers. Tricky, but entirely possible.
Web Design & Software Development
Such an attack on Nutters.org forced me to stop doing my own hosting on a DSL line, since it got utterly swamped and cost way too much in bandwidth. Amusingly, it has forced me into using a much cheaper and higher bandwidth service -- one where such attacks are no longer my problem. The rules of the game have changed for me, though: I no longer consider it viable to host a website on a low-bandwidth leaf node like a single DSL, even where normal usage would make it seem acceptable, since it makes you a sitting duck for this kind of attack. I still can't imagine why anyone would want to target Nutters.org; being small and unworthy of attack doesn't seem to be a good defense anymore.
proof, n. A demonstration that a conclusion is implied by certain premises and axioms.
- whether you have more than one use (spam filtering) for it,
- how much of a geek you are (do you really want to have to compile it yourself, or does that give you thrills?),
- OS - this determines more than you might expect,
- the stats that are out there (there's little doubt that CRM114 is the best at what it does, but there are plenty of others in the very high 90's)
Besides, the more the merrier - the more algorithms out there and the more spam corpi that exist, the harder it is to get ANY spam through.-Ed
Web Design & Software Development
I'm all for the idea, and as a matter of fact, I suggested it a couple of months ago.
If individual spam victims start repetitively downloading the spammers website, this could bring the spammer to change the way he sends spam from the current big bang technique to a small continuous trickle technique. The spammer would send a single spam over several weeks, in stead of a few hours. He would parallelize the process.
I see two possible counter-attacks to this :
Feel the rage !
Graham did mention users with broadband connections, implying that this would be something that the client would pull down.
In other words, you get a more accurate filter which takes into account more than the message itself -- it also considers the content which the message is trying to put across.
Somebody get that guy an ambulance!
What you're proposing is that you send a message in response to every message you receive. Furthermore, you're suggesting that the message you send in response have an invalid (random) return address.
How is this a good idea?
Okay, say machine scott@b.com is sending to larry@a.com. Assume that all machines are running your "callback" software.
B connects to A. A holds the connection open, as you proposed, and sends a message to scott@b.com, with a forged header so that it looks as though it came from "random1928@c.com".
Okay, B has a pending connection to A. A has an open connection to B, and B tries to deliver the mail to C.
So the user scott@b.com has now gotten spam from random1928@c.com. The operator of c.com isn't happy, because it looks like he's sending spam. The guy at b.com isn't happy, because for every message he sends to a.com ends up in a spam for him.
If the sites involved had catchall aliases (which would accept mail to any address at that domain), the number of connections would increase continually, and nothing would ever actually be confirmed, until a connection or DNS lookup failed somewhere, in which case every pending connection would fail.
SMTP already includes a command for address verification -- it's called VRFY. Most sites shut it off, though, because instead of spamming tons of random addresses, one could just VRFY tons of random addresses. This would make spammers' jobs easier -- they would be able to ensure that each address to which they send mail represents an actual mailbox.
Getting back to your suggestion, though -- this is a truly bad idea. Try it on paper if you don't believe me. Assume that most or all of the hosts are running the software which you propose. Keep in mind that you may suggest inserting headers so that servers can communicate to each other and keep track of which messages are in response to other messages, but headers can (and are!) forged.
Somebody get that guy an ambulance!
It seems like the need for other anti-spam techniques will decrease as these become more popular. Things like ip banning or automated server hacking just hurt more non-spammers.
I installed a free one called K9 (though I donated $20 to the author), and over my last 573 emails (392 spam) it has only made one mistake, making it over 99.8% accurate after its initial training (141 messages). I've only been using it for a few weeks. It's about a 60k download and is very flexible and well behaved. The downside is that it's closed source and built for win32. I don't know if it works under Wine.
The one spam that got through was disguised a typical personal message, except that it was offering a business relationship and contained a personalized image link to determine if I viewed the message.
I tried Mozilla's built in bayesian filter for a few months. It had about 90% accuracy, even though I corrected every single mistake it made. Something's not working there, so probably shouldn't be used to judge the accuracy bayesian filters in general.
I've tried PopFile as well. It seems to have good accuracy, but it's like swatting a fly with a sledgehammer. It's like a full fledged anti-spam server and is best installed on a dedicated server but is not well suited for multi-user environments, and it'd not easy to correct old mistakes or rebuild the word database. It does have the benefit of being cross platform though, and it supports multiple buckets, not just spam and not spam.