Slashdot Mirror


Who Isn't Paying Attention to ROBOTS.TXT?

Kickstart asks: "After wading through the Apache logs, after being hit hard for three hours by a very unfriendly spider, I see that there appear to be real, legitimate, search engines that do not follow robots.txt rules. Looking around, I see that some specialized search engines make no mention of their policy on this or say what servers their spiders come from. Does anyone have information on who follow this standard and who doesn't?"

12 of 85 comments (clear)

  1. zerg by Lord+Omlette · · Score: 3, Interesting

    The next question should be, "How do we make them regret their non-compliance?"

    --
    [o]_O
    1. Re:zerg by Intron · · Score: 3, Informative
      --
      Intron: the portion of DNA which expresses nothing useful.
    2. Re:zerg by Eric+Giguere · · Score: 4, Interesting

      Start returning 500 errors... Or 302s that redirect them back to themselves...

      Eric
      PS: Is there some kind of bot storm going on, I'm getting all kinds of weird accesses to my site today, they're all fetching just the home page and leaving, and the referrer tag is null for everyone... They may be committing click fraud through my site, which makes me mad...
    3. Re:zerg by BrynM · · Score: 3, Informative
      From the WebPoison site:
      "WebPoison.org is an open source project... (at the bottom of the page) *Technically speaking, webpoison.org is not "open source" because the source code may never be made public- doing so would undermine the project's central goal.
      Sorry, but it rubs me wrong when a project claims to be OSS on the first line of their about page only to tell me they lied in the fine print at the bottom. They may be doing a good thing, but they should be blunt and honest about it.
      --
      US Democracy:The best person for the job (among These pre-selected choices...)
  2. Spammers are bad (of course) by grub · · Score: 4, Insightful
    Does anyone have information on who follow this standard and who doesn't?

    Most crawlers will obey. Spambot email harvesters will usually not. Generate a huge page of crap with loads of fake email addresses and put that in your robots.txt as uncrawlable and watch the spammers grab it.

    --
    Trolling is a art,
  3. Making them Pay by Kelson · · Score: 3, Interesting

    How about Stopping Spambots?

  4. Here is your problem: by Neil+Blender · · Score: 5, Funny

    All spiders are going to ignore your ROBOTS.TXT file. Instead, they look for a file called robots.txt.

  5. Big name != "real" by droleary · · Score: 4, Informative

    I see that there appear to be real, legitimate, search engines that do not follow robots.txt rules.

    No, you rather see some well-known search engines that generate illegitimate traffic instead of behaving properly. I note a number of them in this highly-documented robots.txt file. I'm personally most offended by idiots running this shit, since there is no single IP block to blacklist.

  6. Re:Hey I've got an idea by jbplou · · Score: 4, Insightful

    well you got a poor app if a spider can run right through it without authenicating and inserting/updating/deleting your data.

  7. Known Bad Bots by stoborrobots · · Score: 3, Informative

    Oh, yeah, and to actually answer the OPs question, there are lists of known bad bots out there...

  8. whitehouse.gov/robots.txt by CommandoB · · Score: 5, Interesting

    The whitehouse seems to take a "pre-emptive" approach. Just in case they ever put stuff on the internet that they might someday not want you to see (or that they might not want archived by google), they seem to cover all the bases in their 92KB robots.txt file.

    My personal favorites:
    Disallow: /911/iraq
    Disallow: /911/patriotism/iraq
    Disallow: /911/patriotism2/iraq
    Disallow: /911/sept112002/iraq [sic.]

    There's a theme here. Can you spot it? I'd like to think it's intentional, but at 2255 lines, it may just be that all permutations of Republican buzzwords have been covered.

    --
    Not that I post on slashdot or anything.
  9. Re:On a similar note... by afidel · · Score: 3, Interesting

    I asked rob and he said they check for DDoS's whenever someone try's to post anonymously from an address. I told him it was busted because no one posted anonymously from my IP, and furthermore it's bad netiquet to port scan someone just because they accessed your site. Don't think he cares.

    --
    There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.