Who Isn't Paying Attention to ROBOTS.TXT?
Kickstart asks: "After wading through the Apache logs, after being hit hard for three hours by a very unfriendly spider, I see that there appear to be real, legitimate, search engines that do not follow robots.txt rules. Looking around, I see that some specialized search engines make no mention of their policy on this or say what servers their spiders come from. Does anyone have information on who follow this standard and who doesn't?"
Most crawlers will obey. Spambot email harvesters will usually not. Generate a huge page of crap with loads of fake email addresses and put that in your robots.txt as uncrawlable and watch the spammers grab it.
Trolling is a art,
RTFA and realize he's not talking about loss of "sensitive" data, but rather the DOS effect of extra traffic from rude robots.
There are good reasons for robots.txt. I use it keep crawlers from hitting "spam" forums on my website which is where all solicitations go. That way no google (or other search engine) rank is gained by spamming the site.
I could just delete it all. But I'm trying to avoid deleting any posts.
well you got a poor app if a spider can run right through it without authenicating and inserting/updating/deleting your data.