Robotcop: It's the Law
Voivod writes: "Inspired by the recent Slashdot and Evolt
discussions about Blocking Bad Spiders, we set out to write an Apache module that solves this problem. The result is Robotcop and it's ready for action. We believe that it's the best solution to protecting Apache webservers from spiders currently available. Install it and help us make life hell for e-mail harvesting software!"
...and they'll just invent a better spider.
How do you prevent users from finding and falling into a trap directory? It seems like it wouldn't be that difficult to write a spider to get around the restrictions imposed by robotcop.
Am I missing something?
The Economics of Website Security
Perhaps not google per se but there are a lot of legit uses for spiders. Uses that are legal and good for your site and the internet in general. I would be concerned that this is going to cause them some undue issues.
What, you don't think we can win an arms race against the degenerates who write email harvesting software? :-)
Right now it provides pretty good protection, especially if the spider needs to get in and out of the site within a set period of time. If you can think of ways to circumvent how Robotcop works, please point them out so we can figure out a solution!
The project says that it feeds malignant spiders poisoned addresses. Don't people check their addresses for addresses that don't deliver? Is this useful? I like the teergrube idea better. Can you modify apache to do this?
I'm a concientious
Unfortunately some good robots have been known to ignore robots.txt. Fast has in the past fallen into my test honeypot, I would hate to accidentally block someone like google.
*shudders*
no sig.
Spiders that follow the rules, of course, can be detected, so what you need is some more to stop spiders that don't, or those that know how not to get stuck in tarpits and those spoofing other clients and not reading robots.txt. The easiest way I can think to do this would be to count how many hits a particular IP has to your server in relation to individual pages. The more unique pages it pulls in a minute, the slower (geometrically) the connection should get. That way even the most hardcore human reader (or group of more casual readers behind a NAT) can click on 30-40 links in a minute and only see a 100ms slowdown, probably not even noticable, but a spider pulling 100 pages will see a 1000ms slowdown, and pulling 200 pages will result in a 10000ms slowdown per page. Sure they can eventually download all the pages, but make it take a week to do it. Combine that with what you already have and it will make for a very unpleasant spidering expierence.
"Your superior intellect is no match for our puny weapons!"
They really seem to catch some weird things that I never thought might be wandering around on my website. I recommend lifting the ban on anyone after a while, though, because you can (almost) never be too certain what you've banned.
Some large search engines have their spiders spread across multiple hosts. Google is one example. What would happen if crawler-01.nastybot.com grabs the robots.txt file, then crawler-02.nastybot.com violates it? I think with all the open proxies out there, spammers would easily adapt to this. Proxy through someone to grab robots.txt, and then through someone else to make use of the file. This would make IP tracking useless; you couldnt even match by subnet (like you could with *.nastybot.com), since the first request could be from 12.34.56.78 and the second from 31.3.3.7.
Solution?
Liberty in your lifetime
on my network we have a http and ftp mirror of the Linux Kernel Archives (ftp3.us.kernel.org/www1.us.kernel.org), OpenBSD, Project Gutenberg, and ProFTPD. we have several distro's in both ISO's and loose files, and all told, over 100gb of data. these damn webbots crawl our site and index it, which takes DAYS, and seriously interferes with our ability to provide a useful http mirror. at one point recently, an altavista webbot was using about 10% of our T-1 and filled up /var with access_logs in a few hours. its only gotten worse. i started blocking the bots based on their browser match type and it has helped a ton. but the only problem with this method is that i have to go through it every day or 2 to keep current. this module looks to do exactly what i need. it wont be foolproof, but will save me and countless others a ton of grunt work.
if you do happen to visit the site, the stupidspider and dumbbot dir/file are part of my currrent spider trap. just thought that i'd warn you.
Sometimes I wget -r to archive a site that I like...
What will be the result of those IBM ads featuring guys in those dorky space suits? Is IBM profit too high? Is that why they are running them?
This reminds me of a humourous day at work. I had done a bit of php programming in my spare time on a site where medical students could get sample problems (or entire quizzes) based on what class they were in. About a week after I handed it over to a medical student to maintain, he had emailed me that some people had been trying to access php pages that were non-existant (basically passing variables in the URL that weren't valid database entries). He was worried that somebody was trying to hack his system. He logged some offending IP addresses and sent them to me in an email.
This is where it gets sorta funny
So I headed over to network-tools.com and looked up the IPs. Each one of them resolved to a webcrawler for a search engine. So I emailed him back explaining that it was just the search engines indexing his pages. It took several more emails to convince him that it was harmless.
Now that I look back at this anecdote, it doesn't seem that humorous, but I guess at the time I was pretty amused by the fact that a medical student was panicking thinking that a webcrawler was 'hacking' his system (if you're wondering, btw, these online quizzes have absolutely no weight in the medical school courses -- it's just for practice).
-Sou|cuttr
There are infinite variations on this theme, like the transparent gif you mentioned, which makes it very difficult for evil spiders to avoid them. Just make sure you test with lynx/w3m first.
Just make sure that it takes at least two steps to get from content to the honeypot. This way, it becomes much more difficult to accidentally tab to a link and activate it, shutting off an entire ISP's proxied access to the web server.
Will I retire or break 10K?
That way even the most hardcore human reader (or group of more casual readers behind a NAT) can click on 30-40 links in a minute
What about over 20 Million Members on one ISP's proxy? A story circulating around several tech news sites (about the high likelihood of AOL 8 using Mozilla's Gecko engine) places AOL's U.S. market share at about 30%. Do you really want to drive away 30% of your audience? What about the billion-plus people behind China's NAT?
Will I retire or break 10K?
The Robotcop download page states that no binaries are available for versions of Apache HTTP Server designed for M$ Windows, and the binaries that do exist (for Red Hat Linux x86 and FreeBSD x86) aren't very compatible with mod_ssl.
"So compile it yourself!" For one thing, according to the compilation instructions, those who want to compile Robotcop for Windows will have to wait a year (estimated) until Apache 2.0 is no longer eta but Released. For another, not everybody can afford a license for M$ Visual Studio, which is required to build Apache HTTP Server; apparently, this popular Win32 version of GCC doesn't cut it.
In other words, Robotcop won't work for consumers who serve web pages from their home workstation that runs Windows.
Will I retire or break 10K?
P.S. Don't call people "consumers". Even if they are Windows users, it's not nice. :-)
Then what is the correct term for people who go into Best Buy, buy a PC, and use only the operating system that Microsoft forced the PC vendor to pre-install because the buyer doesn't know better? I used "consumer" to refer to those who use Windows on their home computers not by choice but by ignorance of other options or by lack of drivers for proprietary devices.
Will I retire or break 10K?
Bernard Shifman?
r on Spammer.html
http://www.petemoss.com/spamflames/ShifmanIsAMo
:-P
The problem has never been about being able to stop illegitimate programs, but rather to ONLY stop illegitimate programs, and not authentic ones as well. Let's name one that could find problems: Google. Okay, they wise up and allow accesses from Google (and some other select few) to go through. Problem 1: Smarter spiders can take advantage of this. Problem 2: Anyone who wants to start up a new and similar service would first have to make sure that it is registered so that it is not blocked out. This could turn into a beaurocratic nightmare, and may restrict competition by disallowing new, small, and innovative contenders
Please don't cure the illness by killing the patient.
To hide our email addresses or hinder their harvesting is like trying not to be seen by the wrong people when you go to a party. It's useless, every girl can tell you that.
Instead just learn who is right for you and say 'no' to the others.
The effort to install and test that module is wasted - and better put into quick and effective spam-blocking techniques, backed up by propper site policies.