Robotcop: It's the Law

← Back to Stories (view on slashdot.org)

Posted by michael on Monday March 11, 2002 @10:27AM from the you-have-ten-seconds-to-comply dept.

Voivod writes: "Inspired by the recent Slashdot and Evolt discussions about Blocking Bad Spiders, we set out to write an Apache module that solves this problem. The result is Robotcop and it's ready for action. We believe that it's the best solution to protecting Apache webservers from spiders currently available. Install it and help us make life hell for e-mail harvesting software!"

15 of 54 comments (clear)

Min score:

Reason:

Sort:

What about spoofing spiders? by regen · 2002-03-11 10:52 · Score: 3, Interesting

It's an interesting idea, but it looks like the spiders have to be well behaved to get caught. If the spider never reads the robots.txt file and it claims to be a friendly user agent (not a spider) it seems the only way it could get caught is if it falls into a trap directory. This doesn't seem likely.
How do you prevent users from finding and falling into a trap directory? It seems like it wouldn't be that difficult to write a spider to get around the restrictions imposed by robotcop.
Am I missing something?

--
The Economics of Website Security
1. Re:What about spoofing spiders? by Voivod · 2002-03-11 11:03 · Score: 4, Informative
  
  You just put hidden links in your HTML which only a spider's HTML parser would notice and follow. This technique is already widely used by wpoison which is a Perl CGI solution to the spider problem.
  
  Check out the robotcop.org site. It has examples of how to set all this up.
2. Re:What about spoofing spiders? by regen · 2002-03-11 11:13 · Score: 2
  
  The idea of a hidden link that the spider would follow seems like a good idea, but I don't see an example on the robotcop site. Do you have a direct link?
  
  I guess I could see setting this up similar to a web bug, 1x1 pixel image with link same color as background, but you could then modify the spider not to follow those types of links.
  
  I think this will be better than nothing, but you start to enter a robotcop / spider arms race.
  
  --
  The Economics of Website Security
But aren't poisoned addresses just stupid? by gartogg · 2002-03-11 11:22 · Score: 2

The project says that it feeds malignant spiders poisoned addresses. Don't people check their addresses for addresses that don't deliver? Is this useful? I like the teergrube idea better. Can you modify apache to do this?

--
I'm a concientious .sig objector.
1. Re:But aren't poisoned addresses just stupid? by friscolr · 2002-03-12 18:45 · Score: 3, Informative
  
  3 things...
  1- why not add valid addresses that get sent to /dev/null? e.g. aaaaaa@example.com through zzzzzz@example.com. you'd get a substantial amount of traffic for that many addresses, but you could modify the amount of addresses to whatever your bandwidth/server could handle.
  2- what are the legalities involved in creating a webpage with a specific email address, perhaps send.mail.here.to.be.charged@example.com, and placing it *only* on a webpage with a blatant notice of "If you mail this address you allow me to charge you $100/byte sent to this address" or a more specific terms of use (in order to encompass selling the address to others) and then charge once you get mail to that address? could a terms of use be created that would make getting money legal?
  3 - how many of you use Matt Wright's (*shudder* when you hear his name) formmail? how many of you use fake formmail scripts?
  for a while now i've been using a fake formmail script that only prints out a webpage saying "thank you for using this script" but doesn't actually send mail. Some people see that output (ignoring the html comment that says "I HATE YOU YOU STUPID PIECE OF SHIT"), think the script has worked, and run a program to submit spam to the script to "send" mail to a few thousand addresses.
  so far my fake script has saved thousands of addresses from getting spam. some people test the script with their address first and then dont come back when they dont get the mail, but i could modify the script to send out the first mail from an ip, but not the subsequent mail.
  but im wondering, has anyone else done work on this or heard of work like this?
  If you don't understand what i'm talking about in point 3- "Matt Wright" (is he a real person?) has a series of scripts, one is formmail.pl which allows mail to be sent to any address. some people search for servers with formmail.pl on them and use those scripts to pseudonymously send mail to other people. We had seen this quite a bit at work, which inspired me to create the fake formmail.pl.
  are there any other common scripts like formmail.pl that could be faked in the same manner?
  
  --
  -f
  www.blackant.net
2. Re:But aren't poisoned addresses just stupid? by ShaunC · 2002-03-22 18:55 · Score: 2
  
  >but im wondering, has anyone else done work on this or heard of work like this?
  
  I wrote one a few weeks ago that catches FormMail probes and mails a warning message to the person who's probing. Since putting the script in place across several domains, I've seen a significant decrease in repeat offenders. I used to get scanned by the same people day in and day out (i.e. the recipient value in the GET requests was the same), and I had a few who'd scan me weekly. Not anymore.
  
  FWIW, the script is here. It's written in PHP, so you'll have to either redirect requests for formmail.pl to the PHP version, or use a CGI wrapper (hence the shebang line at the top of a PHP script :)
  
  I also like the idea of a FormMail honeypot. Basically such a script would accept and deliver the first message received from any IP address; this would be the test message indicating that the probe was successful, so you'd want to make sure it was actually sent. Subsequent accesses to the script from the same /8 over the next 24 hours would generate log entries but not actually send mail, the spammer would be spamming into a black hole, [complete the honeypot analogy here]. I'd do it myself but I don't care for the idea of someone hammering me with thousands and thousands of requests. I'd love to know if someone else sets something like this up, though.
  
  Shaun
  
  --
  Thanks to the War on Drugs, it's easier to buy meth than it is to buy cold medicine!
From experience... by perlyking · 2002-03-11 11:41 · Score: 2

Unfortunately some good robots have been known to ignore robots.txt. Fast has in the past fallen into my test honeypot, I would hate to accidentally block someone like google.

*shudders*

--
no sig.
Needs More by gnovos · 2002-03-11 13:03 · Score: 2

Spiders that follow the rules, of course, can be detected, so what you need is some more to stop spiders that don't, or those that know how not to get stuck in tarpits and those spoofing other clients and not reading robots.txt. The easiest way I can think to do this would be to count how many hits a particular IP has to your server in relation to individual pages. The more unique pages it pulls in a minute, the slower (geometrically) the connection should get. That way even the most hardcore human reader (or group of more casual readers behind a NAT) can click on 30-40 links in a minute and only see a 100ms slowdown, probably not even noticable, but a spider pulling 100 pages will see a 1000ms slowdown, and pulling 200 pages will result in a 10000ms slowdown per page. Sure they can eventually download all the pages, but make it take a week to do it. Combine that with what you already have and it will make for a very unpleasant spidering expierence.

--
"Your superior intellect is no match for our puny weapons!"
perl examples by po_boy · 2002-03-12 04:18 · Score: 2

Pretty cool idea, and useful, too. There are some mod_perl modules available, too. There's an example in the mod_perl Developer's Cookbook and I wrote a simple one here.

They really seem to catch some weird things that I never thought might be wandering around on my website. I recommend lifting the ban on anyone after a while, though, because you can (almost) never be too certain what you've banned.
Re:Arms Races by friscolr · 2002-03-12 06:15 · Score: 4, Insightful

If you can think of ways to circumvent how Robotcop works, please point them out so we can figure out a solution!
looking over the technical review and the readme, a few initial, random, and sporadic thoughts:
the blocking of valid users seems rather annoying (NAT users, some proxy users) and a bad spider could get around the short interval by increasing its sleep time.
IPv6 could screw your implementation. If i have access to a huge number of IP addresses then i could access your website through any one of those addresses. A spider could run an initial probe of a few million websites through one ip, change ips, then grab a second page from all those websites, change ips, grab webpage, etc etc.
if i know a website is running robotcop, can i screw over valid users by forging my ip address, accessing robots.txt, then accessing a honeypot dir? can i screw over all users by cycling through all ips and doing this (yeah that's time consuming, maybe i could just screw over users from one range?)?
The main problems i see from the robotcop approach is that it assumes everyone who accesses robots.txt is a robot and it assumes valid users will not follow certain paths through the website.
This is different for email poisoners b/c if i'm a user and i get to page with a bunch of (invalid) email addresses, it doesn't matter. i click back and continue on my way. but for something that actually *blocks* users, it's a bit different.
As it stands now, i could go to an internet cafe (often they use nat) and block every other user from seeing any site protected by robotcop.
How about tying both User-Agent and IP address to form valid/invalid users? that way a bad user behind NAT might get blocked while a good user could go on. The more information you can tie to one particular thread of access, the more likely you are to single out one particular user.
Instead of only blocking ips that seem to be bad spiders, why not feed themm specific information? that way if it is a user you can let them go on - "if you are a valid user, enter the word in the graphic below in this text field and click 'ok'!"
It really seems that whatever you do, it is possible to work around. Set cookies? i write a bot that keeps track of cookies. hidden webbugs/urls? my bot avoids these.
I can see robotcop as working in small cases, like for a limited number of servers on the internet, b/c then it is not worth the bot writer's time to implement work arounds. But once it becomes worth their time, you have a game of evolution.
Not that that's bad; keep a small enough base of users and you probably wont need to update methods all that often.

--
-f
www.blackant.net
Re:Arms Races by J'raxis · 2002-03-12 07:22 · Score: 2, Interesting

What about a bot set to change user-agents on the fly? Just collect the few most-popular UAs from other peoples website logs, and use each one at random. Add in a list of open proxies to bounce through and you have a nearly undetectable spider at work. I believe I can do this in about a dozen lines of Perl.

Maybe you could thwart this by seeing if there are traversal patterns coming from all over the place ("GET /a" from 1.2.3.4, "GET /a/a.html" from 6.7.8.9, "GET /b" from 45.56.56.67, and so on, but that seems like a lot of work and could again be defeated by some randomization.

--
Liberty in your lifetime
good for FTP sites by eufaula · 2002-03-12 07:35 · Score: 2, Interesting

on my network we have a http and ftp mirror of the Linux Kernel Archives (ftp3.us.kernel.org/www1.us.kernel.org), OpenBSD, Project Gutenberg, and ProFTPD. we have several distro's in both ISO's and loose files, and all told, over 100gb of data. these damn webbots crawl our site and index it, which takes DAYS, and seriously interferes with our ability to provide a useful http mirror. at one point recently, an altavista webbot was using about 10% of our T-1 and filled up /var with access_logs in a few hours. its only gotten worse. i started blocking the bots based on their browser match type and it has helped a ton. but the only problem with this method is that i have to go through it every day or 2 to keep current. this module looks to do exactly what i need. it wont be foolproof, but will save me and countless others a ton of grunt work.

if you do happen to visit the site, the stupidspider and dumbbot dir/file are part of my currrent spider trap. just thought that i'd warn you.
Protect lynx/links/w3m users by using two steps by yerricde · 2002-03-13 07:18 · Score: 2

There are infinite variations on this theme, like the transparent gif you mentioned, which makes it very difficult for evil spiders to avoid them. Just make sure you test with lynx/w3m first.

Just make sure that it takes at least two steps to get from content to the honeypot. This way, it becomes much more difficult to accidentally tab to a link and activate it, shutting off an entire ISP's proxied access to the web server.

--
Will I retire or break 10K?
Over 20 Million Members(tm) on one proxy by yerricde · 2002-03-13 07:31 · Score: 2

That way even the most hardcore human reader (or group of more casual readers behind a NAT) can click on 30-40 links in a minute

What about over 20 Million Members on one ISP's proxy? A story circulating around several tech news sites (about the high likelihood of AOL 8 using Mozilla's Gecko engine) places AOL's U.S. market share at about 30%. Do you really want to drive away 30% of your audience? What about the billion-plus people behind China's NAT?

--
Will I retire or break 10K?
Not compatible with Windows Apache by yerricde · 2002-03-13 07:45 · Score: 2

10 LET M$ = "Microsoft"
The Robotcop download page states that no binaries are available for versions of Apache HTTP Server designed for M$ Windows, and the binaries that do exist (for Red Hat Linux x86 and FreeBSD x86) aren't very compatible with mod_ssl.

"So compile it yourself!" For one thing, according to the compilation instructions, those who want to compile Robotcop for Windows will have to wait a year (estimated) until Apache 2.0 is no longer eta but Released. For another, not everybody can afford a license for M$ Visual Studio, which is required to build Apache HTTP Server; apparently, this popular Win32 version of GCC doesn't cut it.

In other words, Robotcop won't work for consumers who serve web pages from their home workstation that runs Windows.

--
Will I retire or break 10K?