Robotcop: It's the Law
Voivod writes: "Inspired by the recent Slashdot and Evolt
discussions about Blocking Bad Spiders, we set out to write an Apache module that solves this problem. The result is Robotcop and it's ready for action. We believe that it's the best solution to protecting Apache webservers from spiders currently available. Install it and help us make life hell for e-mail harvesting software!"
How do you prevent users from finding and falling into a trap directory? It seems like it wouldn't be that difficult to write a spider to get around the restrictions imposed by robotcop.
Am I missing something?
The Economics of Website Security
The project says that it feeds malignant spiders poisoned addresses. Don't people check their addresses for addresses that don't deliver? Is this useful? I like the teergrube idea better. Can you modify apache to do this?
I'm a concientious
Unfortunately some good robots have been known to ignore robots.txt. Fast has in the past fallen into my test honeypot, I would hate to accidentally block someone like google.
*shudders*
no sig.
Spiders that follow the rules, of course, can be detected, so what you need is some more to stop spiders that don't, or those that know how not to get stuck in tarpits and those spoofing other clients and not reading robots.txt. The easiest way I can think to do this would be to count how many hits a particular IP has to your server in relation to individual pages. The more unique pages it pulls in a minute, the slower (geometrically) the connection should get. That way even the most hardcore human reader (or group of more casual readers behind a NAT) can click on 30-40 links in a minute and only see a 100ms slowdown, probably not even noticable, but a spider pulling 100 pages will see a 1000ms slowdown, and pulling 200 pages will result in a 10000ms slowdown per page. Sure they can eventually download all the pages, but make it take a week to do it. Combine that with what you already have and it will make for a very unpleasant spidering expierence.
"Your superior intellect is no match for our puny weapons!"
They really seem to catch some weird things that I never thought might be wandering around on my website. I recommend lifting the ban on anyone after a while, though, because you can (almost) never be too certain what you've banned.
looking over the technical review and the readme, a few initial, random, and sporadic thoughts:
the blocking of valid users seems rather annoying (NAT users, some proxy users) and a bad spider could get around the short interval by increasing its sleep time.
IPv6 could screw your implementation. If i have access to a huge number of IP addresses then i could access your website through any one of those addresses. A spider could run an initial probe of a few million websites through one ip, change ips, then grab a second page from all those websites, change ips, grab webpage, etc etc.
if i know a website is running robotcop, can i screw over valid users by forging my ip address, accessing robots.txt, then accessing a honeypot dir? can i screw over all users by cycling through all ips and doing this (yeah that's time consuming, maybe i could just screw over users from one range?)?
The main problems i see from the robotcop approach is that it assumes everyone who accesses robots.txt is a robot and it assumes valid users will not follow certain paths through the website.
This is different for email poisoners b/c if i'm a user and i get to page with a bunch of (invalid) email addresses, it doesn't matter. i click back and continue on my way. but for something that actually *blocks* users, it's a bit different.
As it stands now, i could go to an internet cafe (often they use nat) and block every other user from seeing any site protected by robotcop.
How about tying both User-Agent and IP address to form valid/invalid users? that way a bad user behind NAT might get blocked while a good user could go on. The more information you can tie to one particular thread of access, the more likely you are to single out one particular user.
Instead of only blocking ips that seem to be bad spiders, why not feed themm specific information? that way if it is a user you can let them go on - "if you are a valid user, enter the word in the graphic below in this text field and click 'ok'!"
It really seems that whatever you do, it is possible to work around. Set cookies? i write a bot that keeps track of cookies. hidden webbugs/urls? my bot avoids these.
I can see robotcop as working in small cases, like for a limited number of servers on the internet, b/c then it is not worth the bot writer's time to implement work arounds. But once it becomes worth their time, you have a game of evolution.
Not that that's bad; keep a small enough base of users and you probably wont need to update methods all that often.
-f
www.blackant.net
What about a bot set to change user-agents on the fly? Just collect the few most-popular UAs from other peoples website logs, and use each one at random. Add in a list of open proxies to bounce through and you have a nearly undetectable spider at work. I believe I can do this in about a dozen lines of Perl.
/a" from 1.2.3.4, "GET /a/a.html" from 6.7.8.9, "GET /b" from 45.56.56.67, and so on, but that seems like a lot of work and could again be defeated by some randomization.
Maybe you could thwart this by seeing if there are traversal patterns coming from all over the place ("GET
Liberty in your lifetime
on my network we have a http and ftp mirror of the Linux Kernel Archives (ftp3.us.kernel.org/www1.us.kernel.org), OpenBSD, Project Gutenberg, and ProFTPD. we have several distro's in both ISO's and loose files, and all told, over 100gb of data. these damn webbots crawl our site and index it, which takes DAYS, and seriously interferes with our ability to provide a useful http mirror. at one point recently, an altavista webbot was using about 10% of our T-1 and filled up /var with access_logs in a few hours. its only gotten worse. i started blocking the bots based on their browser match type and it has helped a ton. but the only problem with this method is that i have to go through it every day or 2 to keep current. this module looks to do exactly what i need. it wont be foolproof, but will save me and countless others a ton of grunt work.
if you do happen to visit the site, the stupidspider and dumbbot dir/file are part of my currrent spider trap. just thought that i'd warn you.
There are infinite variations on this theme, like the transparent gif you mentioned, which makes it very difficult for evil spiders to avoid them. Just make sure you test with lynx/w3m first.
Just make sure that it takes at least two steps to get from content to the honeypot. This way, it becomes much more difficult to accidentally tab to a link and activate it, shutting off an entire ISP's proxied access to the web server.
Will I retire or break 10K?
That way even the most hardcore human reader (or group of more casual readers behind a NAT) can click on 30-40 links in a minute
What about over 20 Million Members on one ISP's proxy? A story circulating around several tech news sites (about the high likelihood of AOL 8 using Mozilla's Gecko engine) places AOL's U.S. market share at about 30%. Do you really want to drive away 30% of your audience? What about the billion-plus people behind China's NAT?
Will I retire or break 10K?
The Robotcop download page states that no binaries are available for versions of Apache HTTP Server designed for M$ Windows, and the binaries that do exist (for Red Hat Linux x86 and FreeBSD x86) aren't very compatible with mod_ssl.
"So compile it yourself!" For one thing, according to the compilation instructions, those who want to compile Robotcop for Windows will have to wait a year (estimated) until Apache 2.0 is no longer eta but Released. For another, not everybody can afford a license for M$ Visual Studio, which is required to build Apache HTTP Server; apparently, this popular Win32 version of GCC doesn't cut it.
In other words, Robotcop won't work for consumers who serve web pages from their home workstation that runs Windows.
Will I retire or break 10K?