Who Isn't Paying Attention to ROBOTS.TXT?
Kickstart asks: "After wading through the Apache logs, after being hit hard for three hours by a very unfriendly spider, I see that there appear to be real, legitimate, search engines that do not follow robots.txt rules. Looking around, I see that some specialized search engines make no mention of their policy on this or say what servers their spiders come from. Does anyone have information on who follow this standard and who doesn't?"
The next question should be, "How do we make them regret their non-compliance?"
[o]_O
How about Stopping Spambots?
There was a bunch of fsckwits called dir.com who had a real nasty spider crawling all over the place a few months ago. It blatantly ignored robots.txt, tried dictionary attacks to detect unlinked parts of the website, and may have been trying exploits to crack systems to discover secrets normally protected by passwords or logins. Honeypot email addresses fed to the spider would be spammed within days.
After too many complaints from clients about this nasty behaviour, a number of carriers started blackholing the prefixes of bad spiders at the border routers. Nice simple solution, and then you don't even see the spider traffic. Last I looked, about 20 major ISPs were blackholing prefixes of the worst spider/bot offenders.
Nobody would dare to blackhole google, but there are hundreds of google wannabe's and a few of them are unethical enough to get blocked. And then they wonder why they can't see 75% of the internet.
the AC
Hemos is like...sci-fi fans;he thinks technology is cool, but he hasn't bothered to understand the science it's based on
I always liked the way that arxiv.org dealt with this matter. It clearly says that it will initiate a seek and destroy against your site, if you visit a certain link.
:-)
If you do go there, it initiates a countdown.... I've never stuck around long enough to see what happens when the countdown finishes... I like my internet connection just a little too much for that...
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco
It is better from a tchnical standpoint, but it could be worse from a practical one. Expecially if WPoison generated pages can be automatically detected.
this post contain no useful information, no need to mod it down
I hadn't considerd that until this morning, but you can add to the source to do things like randomize meta tags, include text from other pages at random, etc. to make it less likely to detect a pattern.
If you're *really* serious about non-detection, then you should vary the amount of poison in the pages, so that some will be merely annoying or almost innocent, with links that are completely lethal.
If I was a perl hacker (instead of merely playing a sysadmin at work), I'd write this idea out, so if anyone here wants to have a go, post a link.
I am, and always will be, an idiot. Karma: Coma (mostly effected by
The whitehouse seems to take a "pre-emptive" approach. Just in case they ever put stuff on the internet that they might someday not want you to see (or that they might not want archived by google), they seem to cover all the bases in their 92KB robots.txt file.
My personal favorites: /911/iraq /911/patriotism/iraq /911/patriotism2/iraq /911/sept112002/iraq [sic.]
Disallow:
Disallow:
Disallow:
Disallow:
There's a theme here. Can you spot it? I'd like to think it's intentional, but at 2255 lines, it may just be that all permutations of Republican buzzwords have been covered.
Not that I post on slashdot or anything.
What is with requests for http://xxx.slashdot.org/ok.txt coming through on my webserver as if someone (Slashdot if you trace the IP) is trying to use it as a proxy?
...(continues, of course)
66.35.250.150 - - [29/Jan/2005:09:50:54 -0500] "GET http://it.slashdot.org/ok.txt HTTP/1.0" 404 650 "-"
66.35.250.150 - - [31/Jan/2005:23:24:04 -0500] "GET http://slashdot.org/ok.txt HTTP/1.0" 404 647 "-"
66.35.250.150 - - [04/Feb/2005:23:21:43 -0500] "GET http://slashdot.org/ok.txt HTTP/1.0" 404 647 "-"
66.35.250.150 - - [08/Feb/2005:21:55:18 -0500] "GET http://it.slashdot.org/ok.txt HTTP/1.0" 404 650 "-"
66.35.250.150 - - [11/Feb/2005:20:27:09 -0500] "GET http://slashdot.org/ok.txt HTTP/1.0" 404 647 "-"
66.35.250.150 - - [21/Feb/2005:20:02:05 -0500] "GET http://games.slashdot.org/ok.txt HTTP/1.0" 404 653 "-"
66.35.250.150 - - [02/Mar/2005:20:56:12 -0500] "GET http://it.slashdot.org/ok.txt HTTP/1.0" 404 651 "-"
66.35.250.150 - - [08/Mar/2005:20:37:50 -0500] "GET http://slashdot.org/ok.txt HTTP/1.0" 404 648 "-"
66.35.250.150 - - [12/Mar/2005:09:43:37 -0500] "GET http://yro.slashdot.org/ok.txt HTTP/1.0" 404 652 "-"
I know the article is about bad spiders, but why is slashdot doing this?