Slashdot Mirror


Who Isn't Paying Attention to ROBOTS.TXT?

Kickstart asks: "After wading through the Apache logs, after being hit hard for three hours by a very unfriendly spider, I see that there appear to be real, legitimate, search engines that do not follow robots.txt rules. Looking around, I see that some specialized search engines make no mention of their policy on this or say what servers their spiders come from. Does anyone have information on who follow this standard and who doesn't?"

2 of 85 comments (clear)

  1. Here is your problem: by Neil+Blender · · Score: 5, Funny

    All spiders are going to ignore your ROBOTS.TXT file. Instead, they look for a file called robots.txt.

  2. whitehouse.gov/robots.txt by CommandoB · · Score: 5, Interesting

    The whitehouse seems to take a "pre-emptive" approach. Just in case they ever put stuff on the internet that they might someday not want you to see (or that they might not want archived by google), they seem to cover all the bases in their 92KB robots.txt file.

    My personal favorites:
    Disallow: /911/iraq
    Disallow: /911/patriotism/iraq
    Disallow: /911/patriotism2/iraq
    Disallow: /911/sept112002/iraq [sic.]

    There's a theme here. Can you spot it? I'd like to think it's intentional, but at 2255 lines, it may just be that all permutations of Republican buzzwords have been covered.

    --
    Not that I post on slashdot or anything.