Paul Graham: Filters that Fight Back

← Back to Stories (view on slashdot.org)

Paul Graham: Filters that Fight Back

Posted by michael on Sunday August 10, 2003 @05:08AM from the auto-DDOS dept.

Mortimer.CA writes "Paul Graham is back with another article about combating spam. It's entitled Filters that Fight Back: 'One intriguing idea is to literally fight back: to make filters disable spammers' servers by automatically following all the links in each incoming email. We may be driven to this in order to achieve accurate filtering anyway. Why wait?' One danger is someone doing a DDoS by sending fake spam."

11 of 328 comments (clear)

Min score:

Reason:

Sort:

response to the lister's comment by ih8apple · 2003-08-10 05:11 · Score: 4, Informative

In response to the comment: "One danger is someone doing a DDoS by sending fake spam"

From the article notes: "[5] The best way to protect against abuse might be to have the central authority whitelist every site by default, and then, by whatever protocol, take certain sites off. Because you can look at the sites before taking them off the whitelist, there is little danger of people abusing this system to attack an innocent site."

--

Why do I h8 apple?
Comparison of Bayesian spam filters by kreide33 · 2003-08-10 05:27 · Score: 5, Informative

I recently switched from a keyword-based spam filter to a bayesian filter. However, there exists several bayesian filter projects and the choice of which to use is not obvious. Therefore, I decided to do an actual test and write up my findings in a review so others can benefit as well. Read it and find out how to win the War on spam.
Do they really care? by eddy · 2003-08-10 05:30 · Score: 3, Informative

My hotmail account gets relentlessly spammed even though I _never_ follow any links from spam or let it load any images. Even before Hotmail introduced the "don't load inline images" feature I always disabled javascript + images before opening any suspected spam.

Basically, can it get worse? They never seem to remove inactive accounts anyway.

I have a domain registered which I've owned for three years, and it's still getting spam for accounts related to the previous owner of said domain. My mailer says "no such account" over and over and over again.

Spammers don't care whether the account exists, is inactive, filtered or whatever. They try to spam it anyway.

--
Belief is the currency of delusion.
1. Re:Do they really care? by Anonymous Coward · 2003-08-10 05:40 · Score: 5, Informative
  
  You can have a domain/subdomain with no A records or MX records and they will keep trying. You can also have nothing but blackhole MXs - hosts that don't exist, but are on routable networks. I've had a domain since 1994, and it was in one of the above states for about 2-3 years.
  
  Last month I put a real MX record in there and pointed it at box that's running a mail server. Sure enough, the spam flows continuously. It's not just the "make up random shit and put @aol.com" idiots either - the big outfits with permanent networks and domains are mailing it too.
  
  I've taught my mail server to quarantine any host that attempts to mail my long-dead domain, so having it go to a routable address is actually useful again. Every attempt they make ruins another open proxy or relay for every other spammer that may find it later.
  
  You might consider using those "never valid/previous owner" accounts as spam traps. Anything coming to them now is obviously worthless, so why not make them suffer for trying?
Re:No such thing by wavecoder · 2003-08-10 05:32 · Score: 2, Informative

there is no 'fake' spam

Not true; several times I have received spams so carefully put together that they looked like they came from one of my addresses. For example, I used to have an address like me@school.edu; it's been inactive for some time, but once in a while I'll get a message claiming to be from that address, complete with perfectly spoofed headers. Tricky, but entirely possible.

--
Web Design & Software Development
DDoS with IFRAMEs by The+Famous+Brett+Wat · 2003-08-10 05:50 · Score: 4, Informative

The problems with spam-based DDoS are bad enough already. Many HTML mail readers honour IFRAME tags, so if you want to DDoS someone, then just combine a Joe Job (fake their identity, advertise their site) with an HTML mail that contains N IFRAMEs, each set to be one pixel high and refer to a large page on the victim's site. Anyone who reads the spam in an uncautious HTML-capable mail client (of which there are still way too many) will subsequently attempt to fetch the specified page N times, unless you're lucky with intermediate caching proxies or the user hitting the stop button.
Such an attack on Nutters.org forced me to stop doing my own hosting on a DSL line, since it got utterly swamped and cost way too much in bandwidth. Amusingly, it has forced me into using a much cheaper and higher bandwidth service -- one where such attacks are no longer my problem. The rules of the game have changed for me, though: I no longer consider it viable to host a website on a low-bandwidth leaf node like a single DSL, even where normal usage would make it seem acceptable, since it makes you a sitting duck for this kind of attack. I still can't imagine why anyone would want to target Nutters.org; being small and unworthy of attack doesn't seem to be a good defense anymore.

--
proof, n. A demonstration that a conclusion is implied by certain premises and axioms.
Re:Choosing A Bayesian Filter by wavecoder · 2003-08-10 06:22 · Score: 2, Informative
First of all, these are not apples to apples. Popfile is a multi-purpose classifier; CRM114 is a multi-purpose filter; the others are sole-purpose filters, to my knowledge. So, it depends on:
1. whether you have more than one use (spam filtering) for it,
2. how much of a geek you are (do you really want to have to compile it yourself, or does that give you thrills?),
3. OS - this determines more than you might expect,
4. the stats that are out there (there's little doubt that CRM114 is the best at what it does, but there are plenty of others in the very high 90's)
Besides, the more the merrier - the more algorithms out there and the more spam corpi that exist, the harder it is to get ANY spam through.

-Ed
--
Web Design & Software Development
New Spamming Technique : Trickle Spam. by androse · 2003-08-10 06:34 · Score: 4, Informative
I'm all for the idea, and as a matter of fact, I suggested it a couple of months ago.

If individual spam victims start repetitively downloading the spammers website, this could bring the spammer to change the way he sends spam from the current big bang technique to a small continuous trickle technique. The spammer would send a single spam over several weeks, in stead of a few hours. He would parallelize the process.

I see two possible counter-attacks to this :
- content-based blacklisting (like Vilpul Razor, etc), i.e a central database of links that are currently being used in spam.
- high aggressivity from the victims : if everyone loads the URI 50, 100, or 300 times, then the "trickle method" would probably fail. You should of course change the HTTP User Agent string for each request, and randomize the timing to stop any filtering on the web server.
Feel the rage !
Re:Thoughts on active countermeasures and relays.. by hankaholic · 2003-08-10 07:30 · Score: 2, Informative
Answers:
1. If this caught on in a big way, almost certainly less load than spam imposes on its own, assuming that this was run on the servers. However, since Bayesian filters are best left to the individual to personalize to their own specific preferences, the load would likely be distributed across the clients (such as Mozilla), as opposed to the servers.
  
  Graham did mention users with broadband connections, implying that this would be something that the client would pull down.
2. Fetching an HTTP request and parsing the returned text really has no more security risks than automatically parsing text which is sent to you via email. As long as the software is designed sensibly, there shouldn't be any additional security problems.
3. This is difficult to say, but one benefit of the proposed system is that it only loads pages linked from messages which are not obvious in their classification. What is questionable in one person's inbox may not be questionable in another's. This reduces the chance that a concocted email will create such a DDOS attack -- it would have to be created in such a way as to be tagged as "possibly, but not definitely, spam" by many different programs given the unique corpora of those running the software.
4. This is really the big issue -- making sure that an implementation is widespread enough to make a real difference in the habits of spammers and the networks which support them. Reaching this critical mass may take a while, but the point of the article is that by also parsing the links in the email, you get a better idea of how relevent the message may or may not be.
  
  In other words, you get a more accurate filter which takes into account more than the message itself -- it also considers the content which the message is trying to put across.
--
Somebody get that guy an ambulance!
Re:Another idea by hankaholic · 2003-08-10 07:51 · Score: 2, Informative

Could it work?
Define "work".

What you're proposing is that you send a message in response to every message you receive. Furthermore, you're suggesting that the message you send in response have an invalid (random) return address.

How is this a good idea?

Okay, say machine scott@b.com is sending to larry@a.com. Assume that all machines are running your "callback" software.

B connects to A. A holds the connection open, as you proposed, and sends a message to scott@b.com, with a forged header so that it looks as though it came from "random1928@c.com".

Okay, B has a pending connection to A. A has an open connection to B, and B tries to deliver the mail to C.

So the user scott@b.com has now gotten spam from random1928@c.com. The operator of c.com isn't happy, because it looks like he's sending spam. The guy at b.com isn't happy, because for every message he sends to a.com ends up in a spam for him.

If the sites involved had catchall aliases (which would accept mail to any address at that domain), the number of connections would increase continually, and nothing would ever actually be confirmed, until a connection or DNS lookup failed somewhere, in which case every pending connection would fail.

SMTP already includes a command for address verification -- it's called VRFY. Most sites shut it off, though, because instead of spamming tons of random addresses, one could just VRFY tons of random addresses. This would make spammers' jobs easier -- they would be able to ensure that each address to which they send mail represents an actual mailbox.

Getting back to your suggestion, though -- this is a truly bad idea. Try it on paper if you don't believe me. Assume that most or all of the hosts are running the software which you propose. Keep in mind that you may suggest inserting headers so that servers can communicate to each other and keep track of which messages are in response to other messages, but headers can (and are!) forged.

--
Somebody get that guy an ambulance!
Bayesian filters by dtfinch · 2003-08-10 07:52 · Score: 2, Informative

It seems like the need for other anti-spam techniques will decrease as these become more popular. Things like ip banning or automated server hacking just hurt more non-spammers.

I installed a free one called K9 (though I donated $20 to the author), and over my last 573 emails (392 spam) it has only made one mistake, making it over 99.8% accurate after its initial training (141 messages). I've only been using it for a few weeks. It's about a 60k download and is very flexible and well behaved. The downside is that it's closed source and built for win32. I don't know if it works under Wine.

The one spam that got through was disguised a typical personal message, except that it was offering a business relationship and contained a personalized image link to determine if I viewed the message.

I tried Mozilla's built in bayesian filter for a few months. It had about 90% accuracy, even though I corrected every single mistake it made. Something's not working there, so probably shouldn't be used to judge the accuracy bayesian filters in general.

I've tried PopFile as well. It seems to have good accuracy, but it's like swatting a fly with a sledgehammer. It's like a full fledged anti-spam server and is best installed on a dedicated server but is not well suited for multi-user environments, and it'd not easy to correct old mistakes or rebuild the word database. It does have the benefit of being cross platform though, and it supports multiple buckets, not just spam and not spam.