Slashdot Mirror


Robotcop: It's the Law

Voivod writes: "Inspired by the recent Slashdot and Evolt discussions about Blocking Bad Spiders, we set out to write an Apache module that solves this problem. The result is Robotcop and it's ready for action. We believe that it's the best solution to protecting Apache webservers from spiders currently available. Install it and help us make life hell for e-mail harvesting software!"

3 of 54 comments (clear)

  1. What about spoofing spiders? by regen · · Score: 3, Interesting
    It's an interesting idea, but it looks like the spiders have to be well behaved to get caught. If the spider never reads the robots.txt file and it claims to be a friendly user agent (not a spider) it seems the only way it could get caught is if it falls into a trap directory. This doesn't seem likely.

    How do you prevent users from finding and falling into a trap directory? It seems like it wouldn't be that difficult to write a spider to get around the restrictions imposed by robotcop.

    Am I missing something?

  2. Re:Arms Races by J'raxis · · Score: 2, Interesting

    What about a bot set to change user-agents on the fly? Just collect the few most-popular UAs from other peoples website logs, and use each one at random. Add in a list of open proxies to bounce through and you have a nearly undetectable spider at work. I believe I can do this in about a dozen lines of Perl.

    Maybe you could thwart this by seeing if there are traversal patterns coming from all over the place ("GET /a" from 1.2.3.4, "GET /a/a.html" from 6.7.8.9, "GET /b" from 45.56.56.67, and so on, but that seems like a lot of work and could again be defeated by some randomization.

  3. good for FTP sites by eufaula · · Score: 2, Interesting

    on my network we have a http and ftp mirror of the Linux Kernel Archives (ftp3.us.kernel.org/www1.us.kernel.org), OpenBSD, Project Gutenberg, and ProFTPD. we have several distro's in both ISO's and loose files, and all told, over 100gb of data. these damn webbots crawl our site and index it, which takes DAYS, and seriously interferes with our ability to provide a useful http mirror. at one point recently, an altavista webbot was using about 10% of our T-1 and filled up /var with access_logs in a few hours. its only gotten worse. i started blocking the bots based on their browser match type and it has helped a ton. but the only problem with this method is that i have to go through it every day or 2 to keep current. this module looks to do exactly what i need. it wont be foolproof, but will save me and countless others a ton of grunt work.

    if you do happen to visit the site, the stupidspider and dumbbot dir/file are part of my currrent spider trap. just thought that i'd warn you.