Who Isn't Paying Attention to ROBOTS.TXT?
Kickstart asks: "After wading through the Apache logs, after being hit hard for three hours by a very unfriendly spider, I see that there appear to be real, legitimate, search engines that do not follow robots.txt rules. Looking around, I see that some specialized search engines make no mention of their policy on this or say what servers their spiders come from. Does anyone have information on who follow this standard and who doesn't?"
The next question should be, "How do we make them regret their non-compliance?"
[o]_O
Most crawlers will obey. Spambot email harvesters will usually not. Generate a huge page of crap with loads of fake email addresses and put that in your robots.txt as uncrawlable and watch the spammers grab it.
Trolling is a art,
Can you imagine a beowolf cluster of these?!
1.2.3.Profit!
I *am* a spider, you insensitive clod!
Seriously. If you don't want it to get crawled, don't make it accessible by the outside. If you can't figure out how to do that, you get what you deserve.
How about Stopping Spambots?
RTFA and realize he's not talking about loss of "sensitive" data, but rather the DOS effect of extra traffic from rude robots.
All spiders are going to ignore your ROBOTS.TXT file. Instead, they look for a file called robots.txt.
If you don't want it on the public Internet, then don't link it into the public Internet.
Here's what would seem to work;
1. Create robots.txt, including references to the spam spider trap. Make sure that the legitimate references to normal pages are out numbered by a large margin.
2. When pages that could only be referenced in the spam spider trap are accessed, note the IP address.
3. Slowly respond or block connections from the originating IP address.
Bad guys are punished. Good guys are not. Low impact on system resources.
There's got to be a dozen filters out there that already do this. Anyone have experience using one?
A firewall can not protect you from yourself. Turn off what you do not need. Do not use the firewall to do your work.
I see that there appear to be real, legitimate, search engines that do not follow robots.txt rules.
No, you rather see some well-known search engines that generate illegitimate traffic instead of behaving properly. I note a number of them in this highly-documented robots.txt file. I'm personally most offended by idiots running this shit, since there is no single IP block to blacklist.
After wading through the Apache logs, after being hit hard for three hours by a very unfriendly spider looksmart? they do that a lot around here. /me blocks them at the FW
There was a bunch of fsckwits called dir.com who had a real nasty spider crawling all over the place a few months ago. It blatantly ignored robots.txt, tried dictionary attacks to detect unlinked parts of the website, and may have been trying exploits to crack systems to discover secrets normally protected by passwords or logins. Honeypot email addresses fed to the spider would be spammed within days.
After too many complaints from clients about this nasty behaviour, a number of carriers started blackholing the prefixes of bad spiders at the border routers. Nice simple solution, and then you don't even see the spider traffic. Last I looked, about 20 major ISPs were blackholing prefixes of the worst spider/bot offenders.
Nobody would dare to blackhole google, but there are hundreds of google wannabe's and a few of them are unethical enough to get blocked. And then they wonder why they can't see 75% of the internet.
the AC
Hemos is like...sci-fi fans;he thinks technology is cool, but he hasn't bothered to understand the science it's based on
The robots.txt file is a compapability patch for T-* units. This filter enables them to more efficiently search for John Conner, or provide instructions to other T-* units. Spiders, arachnids, humans and other non authorized users are not allowed to view the true values encoded in the file.
Just block the bot from your site, or write some simple PHP to restrict it from querying the pages you want, and the frequency....
I'd just block the "bad-bots" though, if they don't listen to you, don't give them contact.
Or, contact the owner of the domain and get mad at them for spidering without following proper spider rules. He is wasting <b>your</b> resources in exchange for <b>their</b> profit, get mad, get even!
Ah yes, the "write some simple PHP" solution!
http://www.fleiner.com/bots/
I found this site through some slashdotter website long back... I've forgotten where and when, but it lends itself nicely to the topic...
Also good it the way arxiv.org fights back.
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco
Oh, yeah, and to actually answer the OPs question, there are lists of known bad bots out there...
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco
WPoison is a Perl script, as source (naturally).
WPoison is actually better from a technical standpoint, as it's a random page each time, not just a block of pages you download.
I am, and always will be, an idiot. Karma: Coma (mostly effected by
i once had to deliver a solution for that problem to a friend. i made him a php script that detects the content directory and generates a javascript-website which links into the content directory with an encrypted javascript-link which cannot be used by spiders. the content directory is being renamed to some random name every hour. the error404 leads people to the entry-page, in case they surf the content dir while it is being renamed.
The whitehouse seems to take a "pre-emptive" approach. Just in case they ever put stuff on the internet that they might someday not want you to see (or that they might not want archived by google), they seem to cover all the bases in their 92KB robots.txt file.
My personal favorites: /911/iraq /911/patriotism/iraq /911/patriotism2/iraq /911/sept112002/iraq [sic.]
Disallow:
Disallow:
Disallow:
Disallow:
There's a theme here. Can you spot it? I'd like to think it's intentional, but at 2255 lines, it may just be that all permutations of Republican buzzwords have been covered.
Not that I post on slashdot or anything.
There's nothing stopping someone coding a bot that groks script. What about people browsing without scripting?
That's right, you're a moron!
Alternately, use it to your advantage. Have a page of text that is nothing other than porn-related words, and have Apache return that when the bot comes looking. You're guaranteed to get a lot more visitors!
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
What is with requests for http://xxx.slashdot.org/ok.txt coming through on my webserver as if someone (Slashdot if you trace the IP) is trying to use it as a proxy?
...(continues, of course)
66.35.250.150 - - [29/Jan/2005:09:50:54 -0500] "GET http://it.slashdot.org/ok.txt HTTP/1.0" 404 650 "-"
66.35.250.150 - - [31/Jan/2005:23:24:04 -0500] "GET http://slashdot.org/ok.txt HTTP/1.0" 404 647 "-"
66.35.250.150 - - [04/Feb/2005:23:21:43 -0500] "GET http://slashdot.org/ok.txt HTTP/1.0" 404 647 "-"
66.35.250.150 - - [08/Feb/2005:21:55:18 -0500] "GET http://it.slashdot.org/ok.txt HTTP/1.0" 404 650 "-"
66.35.250.150 - - [11/Feb/2005:20:27:09 -0500] "GET http://slashdot.org/ok.txt HTTP/1.0" 404 647 "-"
66.35.250.150 - - [21/Feb/2005:20:02:05 -0500] "GET http://games.slashdot.org/ok.txt HTTP/1.0" 404 653 "-"
66.35.250.150 - - [02/Mar/2005:20:56:12 -0500] "GET http://it.slashdot.org/ok.txt HTTP/1.0" 404 651 "-"
66.35.250.150 - - [08/Mar/2005:20:37:50 -0500] "GET http://slashdot.org/ok.txt HTTP/1.0" 404 648 "-"
66.35.250.150 - - [12/Mar/2005:09:43:37 -0500] "GET http://yro.slashdot.org/ok.txt HTTP/1.0" 404 652 "-"
I know the article is about bad spiders, but why is slashdot doing this?