Who Isn't Paying Attention to ROBOTS.TXT?
Kickstart asks: "After wading through the Apache logs, after being hit hard for three hours by a very unfriendly spider, I see that there appear to be real, legitimate, search engines that do not follow robots.txt rules. Looking around, I see that some specialized search engines make no mention of their policy on this or say what servers their spiders come from. Does anyone have information on who follow this standard and who doesn't?"
zerg?
Intron: the portion of DNA which expresses nothing useful.
I see that there appear to be real, legitimate, search engines that do not follow robots.txt rules.
No, you rather see some well-known search engines that generate illegitimate traffic instead of behaving properly. I note a number of them in this highly-documented robots.txt file. I'm personally most offended by idiots running this shit, since there is no single IP block to blacklist.
US Democracy:The best person for the job (among These pre-selected choices...)
Oh, yeah, and to actually answer the OPs question, there are lists of known bad bots out there...
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco