Robotcop: It's the Law
Voivod writes: "Inspired by the recent Slashdot and Evolt
discussions about Blocking Bad Spiders, we set out to write an Apache module that solves this problem. The result is Robotcop and it's ready for action. We believe that it's the best solution to protecting Apache webservers from spiders currently available. Install it and help us make life hell for e-mail harvesting software!"
How do you prevent users from finding and falling into a trap directory? It seems like it wouldn't be that difficult to write a spider to get around the restrictions imposed by robotcop.
Am I missing something?
The Economics of Website Security
What about a bot set to change user-agents on the fly? Just collect the few most-popular UAs from other peoples website logs, and use each one at random. Add in a list of open proxies to bounce through and you have a nearly undetectable spider at work. I believe I can do this in about a dozen lines of Perl.
/a" from 1.2.3.4, "GET /a/a.html" from 6.7.8.9, "GET /b" from 45.56.56.67, and so on, but that seems like a lot of work and could again be defeated by some randomization.
Maybe you could thwart this by seeing if there are traversal patterns coming from all over the place ("GET
Liberty in your lifetime
on my network we have a http and ftp mirror of the Linux Kernel Archives (ftp3.us.kernel.org/www1.us.kernel.org), OpenBSD, Project Gutenberg, and ProFTPD. we have several distro's in both ISO's and loose files, and all told, over 100gb of data. these damn webbots crawl our site and index it, which takes DAYS, and seriously interferes with our ability to provide a useful http mirror. at one point recently, an altavista webbot was using about 10% of our T-1 and filled up /var with access_logs in a few hours. its only gotten worse. i started blocking the bots based on their browser match type and it has helped a ton. but the only problem with this method is that i have to go through it every day or 2 to keep current. this module looks to do exactly what i need. it wont be foolproof, but will save me and countless others a ton of grunt work.
if you do happen to visit the site, the stupidspider and dumbbot dir/file are part of my currrent spider trap. just thought that i'd warn you.