Who Isn't Paying Attention to ROBOTS.TXT?

← Back to Stories (view on slashdot.org)

Who Isn't Paying Attention to ROBOTS.TXT?

Posted by Cliff on Thursday June 9, 2005 @10:10AM from the bad-spider-no-donut dept.

Kickstart asks: "After wading through the Apache logs, after being hit hard for three hours by a very unfriendly spider, I see that there appear to be real, legitimate, search engines that do not follow robots.txt rules. Looking around, I see that some specialized search engines make no mention of their policy on this or say what servers their spiders come from. Does anyone have information on who follow this standard and who doesn't?"

27 of 85 comments (clear)

Min score:

Reason:

Sort:

zerg by Lord+Omlette · 2005-06-09 10:12 · Score: 3, Interesting

The next question should be, "How do we make them regret their non-compliance?"

--
[o]_O
1. Re:zerg by Intron · 2005-06-09 10:16 · Score: 3, Informative
  
  zerg?
  
  --
  Intron: the portion of DNA which expresses nothing useful.
2. Re:zerg by Eric+Giguere · 2005-06-09 10:21 · Score: 4, Interesting
  
  Start returning 500 errors... Or 302s that redirect them back to themselves...
  Eric
  PS: Is there some kind of bot storm going on, I'm getting all kinds of weird accesses to my site today, they're all fetching just the home page and leaving, and the referrer tag is null for everyone... They may be committing click fraud through my site, which makes me mad...
3. Re:zerg by dasunt · 2005-06-09 10:35 · Score: 2, Informative
  
  The next question should be, "How do we make them regret their non-compliance?"
  
  robots.txt:
  
  User-agent: *
  Disallow: /the-site-that-never-ends/
  
  Its trivial to write a script that will link back to itself to make millions of bogus pages. If you include address rewriting, it won't even appear to be a script.
  
  The only downside is that while you are wasting their CPU and bandwidth, you are also wasting your own resources. If your CPU is mostly idle, then its mostly a waste of bandwidth.
4. Re:zerg by BrynM · 2005-06-09 12:44 · Score: 3, Informative
  
  From the WebPoison site:
  "WebPoison.org is an open source project... (at the bottom of the page) *Technically speaking, webpoison.org is not "open source" because the source code may never be made public- doing so would undermine the project's central goal.
  Sorry, but it rubs me wrong when a project claims to be OSS on the first line of their about page only to tell me they lied in the fine print at the bottom. They may be doing a good thing, but they should be blunt and honest about it.
  
  --
  US Democracy:The best person for the job (among These pre-selected choices...)
5. Re:zerg by Nagus · 2005-06-09 19:00 · Score: 2, Interesting
  
  The next question should be, "How do we make them regret their non-compliance?"
  
  Tarpit them! Bonus points if you feed them bogus data at the same time.
  
  Tarpitting unwelcome spiders not only limits the damage (in terms of bandwidth) they can do to you, but also the damage they can do to everyone else.
  
  Software for this is available, for example Peachpit.
  
  --
  Wenn ist das Nunstruck git und Slotermeyer? Ja!... Beiherhund das Oder die Flipperwaldt gersput!
6. Re:zerg by Avian+visitor · 2005-06-09 23:41 · Score: 2, Informative
  
  PS: Is there some kind of bot storm going on, I'm getting all kinds of weird accesses to my site today, they're all fetching just the home page and leaving, and the referrer tag is null for everyone... They may be committing click fraud through my site, which makes me mad...
  
  See this discussion on SecurityFocus
  
  http://www.securityfocus.com/archive/75/401729/30/ 0/threaded
Spammers are bad (of course) by grub · 2005-06-09 10:14 · Score: 4, Insightful

Does anyone have information on who follow this standard and who doesn't?
Most crawlers will obey. Spambot email harvesters will usually not. Generate a huge page of crap with loads of fake email addresses and put that in your robots.txt as uncrawlable and watch the spammers grab it.

--
Trolling is a art,
1. Re:Spammers are bad (of course) by Dancing+Primate · 2005-06-09 11:13 · Score: 2, Funny
  
  You mean
  User-agent: * Disallow: /
  yes?
2. Re:Spammers are bad (of course) by timothv · 2005-06-09 12:22 · Score: 2, Funny
  
  HAHAHA! Apparently not. See his own robots.txt
Making them Pay by Kelson · 2005-06-09 10:30 · Score: 3, Interesting

How about Stopping Spambots?
I've got a better idea by Kelson · 2005-06-09 10:33 · Score: 2, Insightful

RTFA and realize he's not talking about loss of "sensitive" data, but rather the DOS effect of extra traffic from rude robots.
Re:Hey I've got an idea by etymxris · 2005-06-09 10:39 · Score: 2, Insightful

There are good reasons for robots.txt. I use it keep crawlers from hitting "spam" forums on my website which is where all solicitations go. That way no google (or other search engine) rank is gained by spamming the site.

I could just delete it all. But I'm trying to avoid deleting any posts.
Here is your problem: by Neil+Blender · 2005-06-09 10:45 · Score: 5, Funny

All spiders are going to ignore your ROBOTS.TXT file. Instead, they look for a file called robots.txt.
1. Re:Here is your problem: by AndroidCat · 2005-06-09 22:30 · Score: 2, Funny
  
  What a lot of sites need is a slashdot.txt file.
  
  --
  One line blog. I hear that they're called Twitters now.
Big name != "real" by droleary · 2005-06-09 11:59 · Score: 4, Informative

I see that there appear to be real, legitimate, search engines that do not follow robots.txt rules.

No, you rather see some well-known search engines that generate illegitimate traffic instead of behaving properly. I note a number of them in this highly-documented robots.txt file. I'm personally most offended by idiots running this shit, since there is no single IP block to blacklist.
Re:Hey I've got an idea by jbplou · 2005-06-09 12:11 · Score: 4, Insightful

well you got a poor app if a spider can run right through it without authenicating and inserting/updating/deleting your data.
Blackhole them at the border routers by anticypher · 2005-06-09 13:21 · Score: 2, Interesting

There was a bunch of fsckwits called dir.com who had a real nasty spider crawling all over the place a few months ago. It blatantly ignored robots.txt, tried dictionary attacks to detect unlinked parts of the website, and may have been trying exploits to crack systems to discover secrets normally protected by passwords or logins. Honeypot email addresses fed to the spider would be spammed within days.

After too many complaints from clients about this nasty behaviour, a number of carriers started blackholing the prefixes of bad spiders at the border routers. Nice simple solution, and then you don't even see the spider traffic. Last I looked, about 20 major ISPs were blackholing prefixes of the worst spider/bot offenders.

Nobody would dare to blackhole google, but there are hundreds of google wannabe's and a few of them are unethical enough to get blocked. And then they wonder why they can't see 75% of the internet.

the AC

--
Hemos is like...sci-fi fans;he thinks technology is cool, but he hasn't bothered to understand the science it's based on
Re:Ideal solution... by stoborrobots · 2005-06-09 20:12 · Score: 2, Interesting

I always liked the way that arxiv.org dealt with this matter. It clearly says that it will initiate a seek and destroy against your site, if you visit a certain link.

If you do go there, it initiates a countdown.... I've never stuck around long enough to see what happens when the countdown finishes... I like my internet connection just a little too much for that... :-)

--
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco
How to keep bad robots away by stoborrobots · 2005-06-09 20:15 · Score: 2, Informative

http://www.fleiner.com/bots/

I found this site through some slashdotter website long back... I've forgotten where and when, but it lends itself nicely to the topic...

Also good it the way arxiv.org fights back.

--
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco
Known Bad Bots by stoborrobots · 2005-06-09 20:34 · Score: 3, Informative

Oh, yeah, and to actually answer the OPs question, there are lists of known bad bots out there...

--
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco
Got Zerg Source? by Kalak · 2005-06-10 00:24 · Score: 2, Informative

WPoison is a Perl script, as source (naturally).

WPoison is actually better from a technical standpoint, as it's a random page each time, not just a block of pages you download.

--
I am, and always will be, an idiot. Karma: Coma (mostly effected by .hack)
1. Re:Got Zerg Source? by paulatz · 2005-06-10 00:28 · Score: 2, Interesting
  
  It is better from a tchnical standpoint, but it could be worse from a practical one. Expecially if WPoison generated pages can be automatically detected.
  
  --
  this post contain no useful information, no need to mod it down
2. Re:Got Zerg Source? by Kalak · 2005-06-10 01:10 · Score: 2, Interesting
  
  I hadn't considerd that until this morning, but you can add to the source to do things like randomize meta tags, include text from other pages at random, etc. to make it less likely to detect a pattern.
  
  If you're *really* serious about non-detection, then you should vary the amount of poison in the pages, so that some will be merely annoying or almost innocent, with links that are completely lethal.
  
  If I was a perl hacker (instead of merely playing a sysadmin at work), I'd write this idea out, so if anyone here wants to have a go, post a link.
  
  --
  I am, and always will be, an idiot. Karma: Coma (mostly effected by .hack)
whitehouse.gov/robots.txt by CommandoB · 2005-06-10 07:26 · Score: 5, Interesting

The whitehouse seems to take a "pre-emptive" approach. Just in case they ever put stuff on the internet that they might someday not want you to see (or that they might not want archived by google), they seem to cover all the bases in their 92KB robots.txt file.

My personal favorites:
Disallow: /911/iraq Disallow: /911/patriotism/iraq Disallow: /911/patriotism2/iraq Disallow: /911/sept112002/iraq [sic.]

There's a theme here. Can you spot it? I'd like to think it's intentional, but at 2255 lines, it may just be that all permutations of Republican buzzwords have been covered.

--
Not that I post on slashdot or anything.
On a similar note... by Transcendent · 2005-06-11 16:32 · Score: 2, Interesting

What is with requests for http://xxx.slashdot.org/ok.txt coming through on my webserver as if someone (Slashdot if you trace the IP) is trying to use it as a proxy?

66.35.250.150 - - [29/Jan/2005:09:50:54 -0500] "GET http://it.slashdot.org/ok.txt HTTP/1.0" 404 650 "-"
66.35.250.150 - - [31/Jan/2005:23:24:04 -0500] "GET http://slashdot.org/ok.txt HTTP/1.0" 404 647 "-"
66.35.250.150 - - [04/Feb/2005:23:21:43 -0500] "GET http://slashdot.org/ok.txt HTTP/1.0" 404 647 "-"
66.35.250.150 - - [08/Feb/2005:21:55:18 -0500] "GET http://it.slashdot.org/ok.txt HTTP/1.0" 404 650 "-"
66.35.250.150 - - [11/Feb/2005:20:27:09 -0500] "GET http://slashdot.org/ok.txt HTTP/1.0" 404 647 "-"
66.35.250.150 - - [21/Feb/2005:20:02:05 -0500] "GET http://games.slashdot.org/ok.txt HTTP/1.0" 404 653 "-"
66.35.250.150 - - [02/Mar/2005:20:56:12 -0500] "GET http://it.slashdot.org/ok.txt HTTP/1.0" 404 651 "-"
66.35.250.150 - - [08/Mar/2005:20:37:50 -0500] "GET http://slashdot.org/ok.txt HTTP/1.0" 404 648 "-"
66.35.250.150 - - [12/Mar/2005:09:43:37 -0500] "GET http://yro.slashdot.org/ok.txt HTTP/1.0" 404 652 "-"
...(continues, of course)

I know the article is about bad spiders, but why is slashdot doing this?
1. Re:On a similar note... by afidel · 2005-06-11 20:26 · Score: 3, Interesting
  
  I asked rob and he said they check for DDoS's whenever someone try's to post anonymously from an address. I told him it was busted because no one posted anonymously from my IP, and furthermore it's bad netiquet to port scan someone just because they accessed your site. Don't think he cares.
  
  --
  There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.