Stopping SpamBots With Apache

← Back to Stories (view on slashdot.org)

Stopping SpamBots With Apache

Posted by timothy on Thursday October 18, 2001 @04:59PM from the do-not-pass-go-do-not-collect-200-email-addresses dept.

primetyme writes: "Sick of email harvesting spam robots cruising your Apache based site? Here's an in depth article that shows one way you can configure a base Apache installation to keep those nasty bots of your site - and the spam out of your Inbox." Anything that helps annoy spammers is a good thing.

6 of 55 comments (clear)

Min score:

Reason:

Sort:

It won't work long by anothernobody · 2001-10-18 17:09 · Score: 3, Insightful

Checking the user agent won't work for long - how hard will it be for the spammers to change the user agent to "Mozilla..."

Using some client side Javascript would be harder for them to deal with (although if your browser can view it they will be able to also).

I guess graphics would be next...

--
Surfing slowly, in the Bandwidth Ghetto
You can't win an arms race by CmdrTroll · 2001-10-18 17:16 · Score: 5, Insightful
The premise behind this article is patently ridiculous. Spambots are voluntarily identifying themselves, and any spambot author with an ounce of common sense will simply change their user-agent string to the standard "Mozilla 4.0 (Microsoft Internet Explorer 5.5)" string that every Windows client uses. A well-designed spambot is indistinguishable from a valid user, or Google, or ht://dig.
On the other hand, there are ways to fight spambots; they just don't rely on trusting the user. Here's one way:
- Buy a domain.
- Set up a cgi that generates a unique email address @ that domain for every visitor. Log the address used, the date/time of visit, the visitor's IP, and other characteristics (user-agent?) of the visitor.
- Use the logged data to block the user when spam mail gets sent to one of the random accounts.
- Use the logged data as evidence to present to the offender's ISP, to get their fast connection pulled.
- Find a way to automate this on a large scale, then get a bunch of sysadmins together to sue and prosecute the spammer for abuse of resources.
There are good ways to deal with spammers but this isn't one of them. It *might* work on a small scale and it definitely won't work on a medium or large scale. It's about as useful as the Sendmail "MX/domain validation" trick that Eric Raymond and the rest of the Sendmail team thought would stop spammers dead in its tracks. (It didn't.) Instead he was "surprised by spam."
-CT
1. Re:You can't win an arms race by primetyme · 2001-10-19 02:33 · Score: 3, Insightful
  
  Thats pretty much what I do in the Hook, line, and sinker section of the article.. By capturing the user-agent's and IP's of the Spiderts that *blatently* disregard the robots.txt file, its like shootin fish in a barrel..
  In the next installment of this article, I'm working on a script that grabs the NetBlock of a bot that goes against the robots.txt file, does a ARIN lookup on that block, and emails the administrator of that block with the prob.. Comments have been made that any bot can switch their user-agent string, which is true. If a Spidert does that though, they're more than likely also going to run through the parts of a site that you *specifically* tell them they can't go in the robots.txt file. When they do that, its a lot easier to block their user-agent, email the admin of thier netblock, or block their class c IP block alltogether.
  It's like a honeypot for black-hats if you think about it.. And thats one of the *best* ways to find the problem Spiderts and block them out, without blocking any good natured bot :)
Re:WebPoison anyone? by Anonymous Coward · 2001-10-19 08:38 · Score: 2, Insightful

It's not exactly what you mean, but something similar is The Book of Infinity. It doesn't generate email addresses, but it does generate an infinite website.
Re:WebPoison anyone? by asackett · 2001-10-29 07:40 · Score: 2, Insightful

It's called wpoison, and it's found at http://www.monkeys.com/wpoison/. The problem is that it's very easy to detect -- note the lack of punctuation marks, scarcity of two and three letter words, capital letters, verbs... and the fact that there's a four second pause in the same place, page after page... in short, it would be easy enough to spot a wpoison-generated page.

I've coded up an alternative that suffers none of those obvious defects, and instead of throwing out bogus email addresses, it throws out valid spamcatcher addresses. Any SMTP host who sends a message to one of those addresses is blocked (via DJB's rbldns) for a month from sending mail into my domain. The blocklist is self-maintaining, so I never need to mess with it.

It's been in place for about three months now, and my blocklist contains 125 entries right now -- five of which are netblocks I've manually added. The URL, sure to catch a bucketful of bad spiders thanks to this link, is http://www.artsackett.com/personnel/ and it is intentionally as slow as the rectification of sin.

--
Warning: This signature may offend some viewers.
Re:Good for the goose by Snootch · 2001-10-30 07:45 · Score: 3, Insightful

One big difference - MSN discriminated against valid browsers that were just people trying to view their website. The user agent IDs here (with a coupla exceptions - *cough* wget *cough*) are all things that are only ever used for spam purposes. There is a difference between blocking people because they don't use your software and blocking spam robots.