Slashdot Mirror


Who Isn't Paying Attention to ROBOTS.TXT?

Kickstart asks: "After wading through the Apache logs, after being hit hard for three hours by a very unfriendly spider, I see that there appear to be real, legitimate, search engines that do not follow robots.txt rules. Looking around, I see that some specialized search engines make no mention of their policy on this or say what servers their spiders come from. Does anyone have information on who follow this standard and who doesn't?"

8 of 85 comments (clear)

  1. Re:zerg by Intron · · Score: 3, Informative
    --
    Intron: the portion of DNA which expresses nothing useful.
  2. Re:zerg by dasunt · · Score: 2, Informative
    The next question should be, "How do we make them regret their non-compliance?"

    robots.txt:

    User-agent: *
    Disallow: /the-site-that-never-ends/

    Its trivial to write a script that will link back to itself to make millions of bogus pages. If you include address rewriting, it won't even appear to be a script.

    The only downside is that while you are wasting their CPU and bandwidth, you are also wasting your own resources. If your CPU is mostly idle, then its mostly a waste of bandwidth.

  3. Big name != "real" by droleary · · Score: 4, Informative

    I see that there appear to be real, legitimate, search engines that do not follow robots.txt rules.

    No, you rather see some well-known search engines that generate illegitimate traffic instead of behaving properly. I note a number of them in this highly-documented robots.txt file. I'm personally most offended by idiots running this shit, since there is no single IP block to blacklist.

  4. Re:zerg by BrynM · · Score: 3, Informative
    From the WebPoison site:
    "WebPoison.org is an open source project... (at the bottom of the page) *Technically speaking, webpoison.org is not "open source" because the source code may never be made public- doing so would undermine the project's central goal.
    Sorry, but it rubs me wrong when a project claims to be OSS on the first line of their about page only to tell me they lied in the fine print at the bottom. They may be doing a good thing, but they should be blunt and honest about it.
    --
    US Democracy:The best person for the job (among These pre-selected choices...)
  5. How to keep bad robots away by stoborrobots · · Score: 2, Informative

    http://www.fleiner.com/bots/

    I found this site through some slashdotter website long back... I've forgotten where and when, but it lends itself nicely to the topic...

    Also good it the way arxiv.org fights back.

  6. Known Bad Bots by stoborrobots · · Score: 3, Informative

    Oh, yeah, and to actually answer the OPs question, there are lists of known bad bots out there...

  7. Re:zerg by Avian+visitor · · Score: 2, Informative

    PS: Is there some kind of bot storm going on, I'm getting all kinds of weird accesses to my site today, they're all fetching just the home page and leaving, and the referrer tag is null for everyone... They may be committing click fraud through my site, which makes me mad...

    See this discussion on SecurityFocus

    http://www.securityfocus.com/archive/75/401729/30/ 0/threaded

  8. Got Zerg Source? by Kalak · · Score: 2, Informative

    WPoison is a Perl script, as source (naturally).

    WPoison is actually better from a technical standpoint, as it's a random page each time, not just a block of pages you download.

    --
    I am, and always will be, an idiot. Karma: Coma (mostly effected by .hack)