Robotcop: It's the Law

← Back to Stories (view on slashdot.org)

Posted by michael on Monday March 11, 2002 @10:27AM from the you-have-ten-seconds-to-comply dept.

Voivod writes: "Inspired by the recent Slashdot and Evolt discussions about Blocking Bad Spiders, we set out to write an Apache module that solves this problem. The result is Robotcop and it's ready for action. We believe that it's the best solution to protecting Apache webservers from spiders currently available. Install it and help us make life hell for e-mail harvesting software!"

3 of 54 comments (clear)

Min score:

Reason:

Sort:

What about spoofing spiders? by regen · 2002-03-11 10:52 · Score: 3, Interesting

It's an interesting idea, but it looks like the spiders have to be well behaved to get caught. If the spider never reads the robots.txt file and it claims to be a friendly user agent (not a spider) it seems the only way it could get caught is if it falls into a trap directory. This doesn't seem likely.
How do you prevent users from finding and falling into a trap directory? It seems like it wouldn't be that difficult to write a spider to get around the restrictions imposed by robotcop.
Am I missing something?

--
The Economics of Website Security
Re:Arms Races by J'raxis · 2002-03-12 07:22 · Score: 2, Interesting

What about a bot set to change user-agents on the fly? Just collect the few most-popular UAs from other peoples website logs, and use each one at random. Add in a list of open proxies to bounce through and you have a nearly undetectable spider at work. I believe I can do this in about a dozen lines of Perl.

Maybe you could thwart this by seeing if there are traversal patterns coming from all over the place ("GET /a" from 1.2.3.4, "GET /a/a.html" from 6.7.8.9, "GET /b" from 45.56.56.67, and so on, but that seems like a lot of work and could again be defeated by some randomization.

--
Liberty in your lifetime
good for FTP sites by eufaula · 2002-03-12 07:35 · Score: 2, Interesting

on my network we have a http and ftp mirror of the Linux Kernel Archives (ftp3.us.kernel.org/www1.us.kernel.org), OpenBSD, Project Gutenberg, and ProFTPD. we have several distro's in both ISO's and loose files, and all told, over 100gb of data. these damn webbots crawl our site and index it, which takes DAYS, and seriously interferes with our ability to provide a useful http mirror. at one point recently, an altavista webbot was using about 10% of our T-1 and filled up /var with access_logs in a few hours. its only gotten worse. i started blocking the bots based on their browser match type and it has helped a ton. but the only problem with this method is that i have to go through it every day or 2 to keep current. this module looks to do exactly what i need. it wont be foolproof, but will save me and countless others a ton of grunt work.

if you do happen to visit the site, the stupidspider and dumbbot dir/file are part of my currrent spider trap. just thought that i'd warn you.