Robotcop: It's the Law

← Back to Stories (view on slashdot.org)

Posted by michael on Monday March 11, 2002 @10:27AM from the you-have-ten-seconds-to-comply dept.

Voivod writes: "Inspired by the recent Slashdot and Evolt discussions about Blocking Bad Spiders, we set out to write an Apache module that solves this problem. The result is Robotcop and it's ready for action. We believe that it's the best solution to protecting Apache webservers from spiders currently available. Install it and help us make life hell for e-mail harvesting software!"

4 of 54 comments (clear)

Min score:

Reason:

Sort:

What about spoofing spiders? by regen · 2002-03-11 10:52 · Score: 3, Interesting

It's an interesting idea, but it looks like the spiders have to be well behaved to get caught. If the spider never reads the robots.txt file and it claims to be a friendly user agent (not a spider) it seems the only way it could get caught is if it falls into a trap directory. This doesn't seem likely.
How do you prevent users from finding and falling into a trap directory? It seems like it wouldn't be that difficult to write a spider to get around the restrictions imposed by robotcop.
Am I missing something?

--
The Economics of Website Security
1. Re:What about spoofing spiders? by Voivod · 2002-03-11 11:03 · Score: 4, Informative
  
  You just put hidden links in your HTML which only a spider's HTML parser would notice and follow. This technique is already widely used by wpoison which is a Perl CGI solution to the spider problem.
  
  Check out the robotcop.org site. It has examples of how to set all this up.
Re:Arms Races by friscolr · 2002-03-12 06:15 · Score: 4, Insightful

If you can think of ways to circumvent how Robotcop works, please point them out so we can figure out a solution!
looking over the technical review and the readme, a few initial, random, and sporadic thoughts:
the blocking of valid users seems rather annoying (NAT users, some proxy users) and a bad spider could get around the short interval by increasing its sleep time.
IPv6 could screw your implementation. If i have access to a huge number of IP addresses then i could access your website through any one of those addresses. A spider could run an initial probe of a few million websites through one ip, change ips, then grab a second page from all those websites, change ips, grab webpage, etc etc.
if i know a website is running robotcop, can i screw over valid users by forging my ip address, accessing robots.txt, then accessing a honeypot dir? can i screw over all users by cycling through all ips and doing this (yeah that's time consuming, maybe i could just screw over users from one range?)?
The main problems i see from the robotcop approach is that it assumes everyone who accesses robots.txt is a robot and it assumes valid users will not follow certain paths through the website.
This is different for email poisoners b/c if i'm a user and i get to page with a bunch of (invalid) email addresses, it doesn't matter. i click back and continue on my way. but for something that actually *blocks* users, it's a bit different.
As it stands now, i could go to an internet cafe (often they use nat) and block every other user from seeing any site protected by robotcop.
How about tying both User-Agent and IP address to form valid/invalid users? that way a bad user behind NAT might get blocked while a good user could go on. The more information you can tie to one particular thread of access, the more likely you are to single out one particular user.
Instead of only blocking ips that seem to be bad spiders, why not feed themm specific information? that way if it is a user you can let them go on - "if you are a valid user, enter the word in the graphic below in this text field and click 'ok'!"
It really seems that whatever you do, it is possible to work around. Set cookies? i write a bot that keeps track of cookies. hidden webbugs/urls? my bot avoids these.
I can see robotcop as working in small cases, like for a limited number of servers on the internet, b/c then it is not worth the bot writer's time to implement work arounds. But once it becomes worth their time, you have a game of evolution.
Not that that's bad; keep a small enough base of users and you probably wont need to update methods all that often.

--
-f
www.blackant.net
Re:But aren't poisoned addresses just stupid? by friscolr · 2002-03-12 18:45 · Score: 3, Informative

3 things...
1- why not add valid addresses that get sent to /dev/null? e.g. aaaaaa@example.com through zzzzzz@example.com. you'd get a substantial amount of traffic for that many addresses, but you could modify the amount of addresses to whatever your bandwidth/server could handle.
2- what are the legalities involved in creating a webpage with a specific email address, perhaps send.mail.here.to.be.charged@example.com, and placing it *only* on a webpage with a blatant notice of "If you mail this address you allow me to charge you $100/byte sent to this address" or a more specific terms of use (in order to encompass selling the address to others) and then charge once you get mail to that address? could a terms of use be created that would make getting money legal?
3 - how many of you use Matt Wright's (*shudder* when you hear his name) formmail? how many of you use fake formmail scripts?
for a while now i've been using a fake formmail script that only prints out a webpage saying "thank you for using this script" but doesn't actually send mail. Some people see that output (ignoring the html comment that says "I HATE YOU YOU STUPID PIECE OF SHIT"), think the script has worked, and run a program to submit spam to the script to "send" mail to a few thousand addresses.
so far my fake script has saved thousands of addresses from getting spam. some people test the script with their address first and then dont come back when they dont get the mail, but i could modify the script to send out the first mail from an ip, but not the subsequent mail.
but im wondering, has anyone else done work on this or heard of work like this?
If you don't understand what i'm talking about in point 3- "Matt Wright" (is he a real person?) has a series of scripts, one is formmail.pl which allows mail to be sent to any address. some people search for servers with formmail.pl on them and use those scripts to pseudonymously send mail to other people. We had seen this quite a bit at work, which inspired me to create the fake formmail.pl.
are there any other common scripts like formmail.pl that could be faked in the same manner?

--
-f
www.blackant.net