Who Isn't Paying Attention to ROBOTS.TXT?

zerg by Lord+Omlette · 2005-06-09 10:12 · Score: 3, Interesting

The next question should be, "How do we make them regret their non-compliance?"

--
[o]_O

Re:zerg by Intron · 2005-06-09 10:16 · Score: 3, Informative

zerg?

--
Intron: the portion of DNA which expresses nothing useful.
Re:zerg by Eric+Giguere · 2005-06-09 10:21 · Score: 4, Interesting

Start returning 500 errors... Or 302s that redirect them back to themselves...
Eric
PS: Is there some kind of bot storm going on, I'm getting all kinds of weird accesses to my site today, they're all fetching just the home page and leaving, and the referrer tag is null for everyone... They may be committing click fraud through my site, which makes me mad...
Re:zerg by dasunt · 2005-06-09 10:35 · Score: 2, Informative

The next question should be, "How do we make them regret their non-compliance?"

robots.txt:

User-agent: *
Disallow: /the-site-that-never-ends/

Its trivial to write a script that will link back to itself to make millions of bogus pages. If you include address rewriting, it won't even appear to be a script.

The only downside is that while you are wasting their CPU and bandwidth, you are also wasting your own resources. If your CPU is mostly idle, then its mostly a waste of bandwidth.
Re:zerg by Kelson · 2005-06-09 10:45 · Score: 1

If your site uses PHP, you may be able to adapt Bad Behavior. The script was originally developed for WordPress and has already been ported to MediaWiki and Geeklog. It identifies known "bad" robots and robots that imitate real browsers based on the HTTP headers, then sends an access denied response.
Re:zerg by Eric+Giguere · 2005-06-09 11:03 · Score: 1

Actually, I was more wondering if there was some zombie going around right now. Nobody's really accessing my blog, just the home page. Also, I see a lot of accesses from the CoDeeN project, so I wonder what's up....
Re:zerg by BrynM · 2005-06-09 12:44 · Score: 3, Informative

From the WebPoison site:
"WebPoison.org is an open source project... (at the bottom of the page) *Technically speaking, webpoison.org is not "open source" because the source code may never be made public- doing so would undermine the project's central goal.
Sorry, but it rubs me wrong when a project claims to be OSS on the first line of their about page only to tell me they lied in the fine print at the bottom. They may be doing a good thing, but they should be blunt and honest about it.

--
US Democracy:The best person for the job (among These pre-selected choices...)
Re:zerg by Anonymous Coward · 2005-06-09 15:19 · Score: 0

I just realized I have a page on my website that does this... however I don't exclude it from the robots... /Wooo for mod_rewrite
Re:zerg by crazyphilman · 2005-06-09 17:11 · Score: 1

Here's the programmer solution:

1. Study your firewall logs, and try to determine some baseline criteria that identifies a spider. While you're at it, note the domains the spiders are coming from.

2. Create a small perl script that, when fed the IP address of a questionable domain, automatically does an add (creating a new "DROP" rule for that domain) and tacks the command onto your existing firewall script. This is your manual tool. Of course, you should debug it using your spider list from 1, above.

3. Once you've gotten better at identifying spiders, start your new hobby: tweaking a Perl script that peruses your firewall logs every now and then, and upon detecting a spider, runs the script from part 2, above.

4. Every now and then, browse your firewall script to make sure you haven't banned anything you'll want later. Then laugh the deep belly laugh of a vengeful sysadmin and go have a beer.

--
Farewell! It's been a fine buncha years!
Re:zerg by Nagus · 2005-06-09 19:00 · Score: 2, Interesting

The next question should be, "How do we make them regret their non-compliance?"

Tarpit them! Bonus points if you feed them bogus data at the same time.

Tarpitting unwelcome spiders not only limits the damage (in terms of bandwidth) they can do to you, but also the damage they can do to everyone else.

Software for this is available, for example Peachpit.

--
Wenn ist das Nunstruck git und Slotermeyer? Ja!... Beiherhund das Oder die Flipperwaldt gersput!
Re:zerg by Avian+visitor · 2005-06-09 23:41 · Score: 2, Informative

PS: Is there some kind of bot storm going on, I'm getting all kinds of weird accesses to my site today, they're all fetching just the home page and leaving, and the referrer tag is null for everyone... They may be committing click fraud through my site, which makes me mad...

See this discussion on SecurityFocus

http://www.securityfocus.com/archive/75/401729/30/ 0/threaded

Spammers are bad (of course) by grub · 2005-06-09 10:14 · Score: 4, Insightful

Does anyone have information on who follow this standard and who doesn't?

Most crawlers will obey. Spambot email harvesters will usually not. Generate a huge page of crap with loads of fake email addresses and put that in your robots.txt as uncrawlable and watch the spammers grab it.

--
Trolling is a art,

Re:Spammers are bad (of course) by khodsden · 2005-06-09 10:28 · Score: 1

However, even the big ones don't always. Yahoo, for example, crawls my site despite a robots.txt that says

User-agent * Disallow: /

Emails to them, some which have included threatening legal action, have done little good.
You'd think with all the sites clamouring to be in the search engine results, they'd honor requests to be out.
Re:Spammers are bad (of course) by JabberWokky · 2005-06-09 10:50 · Score: 1

Is it Yahoo or somebody claiming to be Yahoo? I've seen useragents that claim to be from Big Companies that come from dinky IP addresses that don't seem to make sense.
--
Evan

--
"$30 for the One True Ring. $10 each additional ring!" -- JRR "Bob" Tolkien
Re:Spammers are bad (of course) by Dancing+Primate · 2005-06-09 11:13 · Score: 2, Funny

You mean
User-agent: * Disallow: /
yes?
Re:Spammers are bad (of course) by timothv · 2005-06-09 12:22 · Score: 2, Funny

HAHAHA! Apparently not. See his own robots.txt
Re:Spammers are bad (of course) by Anonymous Coward · 2005-06-09 15:28 · Score: 0

+5 FUNNY!!!
Re:Spammers are bad (of course) by timothv · 2005-06-09 18:09 · Score: 1

Looks like he just fixed it.
Re:Spammers are bad (of course) by GraemeDonaldson · 2005-06-10 01:15 · Score: 1

Pwned! And by a dancing primate, no less. You have made my Friday. :-)

--
I think, therefore I am. I think?
Re:Spammers are bad (of course) by Anonymous Coward · 2005-06-10 03:33 · Score: 0

Funniest thing on Slashdot in loooong time.
Re:Spammers are bad (of course) by Thrakkerzog · 2005-06-10 04:07 · Score: 1

damn, i missed it.

anyone save the contents?
Re:Spammers are bad (of course) by khodsden · 2005-06-10 04:20 · Score: 1

Whoo hoo! Slashdot as a help forum! Sweet!
Re:Spammers are bad (of course) by timmyf2371 · 2005-06-11 00:17 · Score: 1

Emails to them, some which have included threatening legal action, have done little good.
Which law makes it illegal for a search engine to ignore robots.txt files? And would you even have a valid claim against them due to non-compliance of a web standard?

--

Backup not found: (A)bort (R)etry (P)anic
Re:Spammers are bad (of course) by Anonymous Coward · 2005-06-11 12:51 · Score: 0

http://yro.slashdot.org/article.pl?sid=05/05/27/16 15217&from=rss

ok, let's get this out of the way... by glamslam · 2005-06-09 10:17 · Score: 0, Offtopic

In soviet russia, spiders ignore you!

Can you imagine a beowolf cluster of these?!

1.2.3.Profit!

I *am* a spider, you insensitive clod!

Re:ok, let's get this out of the way... by Exitar · 2005-06-09 10:26 · Score: 1

All your robots.txt are belong to us.
Re:ok, let's get this out of the way... by Anonymous Coward · 2005-06-09 15:56 · Score: 0

In Korea, only old people read robots.txt

Hey I've got an idea by Anonymous Coward · 2005-06-09 10:30 · Score: 1

Why don't you play a different game. Rather than play "whine about unenforceable standards" why don't you play "Don't put stuff on the internet you don't want people to see".

Seriously. If you don't want it to get crawled, don't make it accessible by the outside. If you can't figure out how to do that, you get what you deserve.

Re:Hey I've got an idea by etymxris · 2005-06-09 10:39 · Score: 2, Insightful

There are good reasons for robots.txt. I use it keep crawlers from hitting "spam" forums on my website which is where all solicitations go. That way no google (or other search engine) rank is gained by spamming the site.

I could just delete it all. But I'm trying to avoid deleting any posts.
Re:Hey I've got an idea by Furry+Ice · 2005-06-09 12:00 · Score: 1

You don't want a spider crawling over a webapp and creating, deleting, updating data it knows nothing about. If there's a new wave of robots.txt ignoring spiders, there's going to be a lot of ugly side effects.
Re:Hey I've got an idea by jbplou · 2005-06-09 12:11 · Score: 4, Insightful

well you got a poor app if a spider can run right through it without authenicating and inserting/updating/deleting your data.
Re:Hey I've got an idea by mabinogi · 2005-06-09 23:31 · Score: 1

why the hell are you making GET requests modify data? That's what POST is for.

GET should only do just that, and a user agent should be allowed to reload a page that is the result of a GET request without fear of side effects.

You'll have trouble from more than bots if you've got an app written like that - you'll have users hitting the back and forward buttons on their browsers causing multiple entry.

[OT]
Slow Down Cowboy!

Slashdot requires you to wait 2 minutes between each successful posting of a comment to allow everyone a fair chance at posting a comment.

It's been 8 minutes since you last successfully posted a comment

Good to see slashcode living up to its usual high quality standards
[/OT]

--
Advanced users are users too!

Making them Pay by Kelson · 2005-06-09 10:30 · Score: 3, Interesting

How about Stopping Spambots?

I've got a better idea by Kelson · 2005-06-09 10:33 · Score: 2, Insightful

RTFA and realize he's not talking about loss of "sensitive" data, but rather the DOS effect of extra traffic from rude robots.

Here is your problem: by Neil+Blender · 2005-06-09 10:45 · Score: 5, Funny

All spiders are going to ignore your ROBOTS.TXT file. Instead, they look for a file called robots.txt.

Re:Here is your problem: by dougmc · 2005-06-09 17:08 · Score: 1

All spiders are going to ignore your ROBOTS.TXT file. Instead, they look for a file called robots.txt.
I was going to say this (but in a better way, of course!) but suspected that somebody else might beat me to it, and indeed they did ...
However, there is a bit more to it. If he has a web server on his Windows or Mac OSX box, the odds are that the filesystem in use is case insensitive, so either robots.txt or ROBOTS.TXT will work, because either would be served up by the web server when one requested /robots.txt ...
Re:Here is your problem: by AndroidCat · 2005-06-09 22:30 · Score: 2, Funny

What a lot of sites need is a slashdot.txt file.

--
One line blog. I hear that they're called Twitters now.
Re:Here is your problem: by fredrikj · 2005-06-09 22:52 · Score: 1

How exactly is the distinction between Slashdotters and other web bots significant?
Re:Here is your problem: by Anonymous Coward · 2005-06-10 01:26 · Score: 0

How exactly is the distinction between Slashdotters and other web bots significant?

Bots are more intelligent than most /.'ers. :o)

Never liked robots.txt anyway. by Anonymous Coward · 2005-06-09 11:13 · Score: 0

If you don't want it on the public Internet, then don't link it into the public Internet.

Re:Never liked robots.txt anyway. by Anonymous Coward · 2005-06-09 16:56 · Score: 0

LOL, you dipshit.

Ideal solution... by Spoing · 2005-06-09 11:37 · Score: 1

Here's what would seem to work;

1. Create robots.txt, including references to the spam spider trap. Make sure that the legitimate references to normal pages are out numbered by a large margin.

2. When pages that could only be referenced in the spam spider trap are accessed, note the IP address.

3. Slowly respond or block connections from the originating IP address.

Bad guys are punished. Good guys are not. Low impact on system resources.

There's got to be a dozen filters out there that already do this. Anyone have experience using one?

--
A firewall can not protect you from yourself. Turn off what you do not need. Do not use the firewall to do your work.

Re:Ideal solution... by TrebleJunkie · 2005-06-09 12:08 · Score: 1

The problem with this approach is this: If the spider doesn't bother to even read the robots.txt file, nothing gets trapped.

--
Ed R.Zahurak

You know, oblivion keeps looking better every day.
Re:Ideal solution... by Spoing · 2005-06-09 12:38 · Score: 1

The problem with this approach is this: If the spider doesn't bother to even read the robots.txt file, nothing gets trapped.
No, that's the point. When the spider ignores robots.txt, they pick up the poisoned pages. Then, because they are doing something wrong, punish them.

--
A firewall can not protect you from yourself. Turn off what you do not need. Do not use the firewall to do your work.
Re:Ideal solution... by Issue9mm · 2005-06-09 14:13 · Score: 1

I think that you're overlooking the point.

If the poisoned pages are only findable from robots.txt, then if they ignore robots.txt, they won't be punished.

If they're findable via links, or whatnot, then you're punishing more than the robots (that means your users). We typically frown on people (**AA) that do that sort of thing, yes?

-9mm-
Re:Ideal solution... by Anonymous Coward · 2005-06-09 14:33 · Score: 0

No, I think you're overlooking the point.

You make a robots.txt like with directories/pages they should not enter. You include links in pages that only a crawler would see normally. Only a crawler that ignored robots.txt or the exclusion in it would go those that trap pages and get banned.
Re:Ideal solution... by Spoing · 2005-06-09 14:48 · Score: 1

I think that you're overlooking the point.
Not at all. The AC has it right;

You make a robots.txt like with directories/pages they should not enter. You include links in pages that only a crawler would see normally. Only a crawler that ignored robots.txt or the exclusion in it would go those that trap pages and get banned.

To make this very clear; the links on the legitimate pages are not normally visible or say things like "." or "," or "This is a trap for sp@mmer$"...whatever. Color the text white on a white background. Dinner is served.

--
A firewall can not protect you from yourself. Turn off what you do not need. Do not use the firewall to do your work.
Re:Ideal solution... by X0563511 · 2005-06-09 17:06 · Score: 1

All that is fine and dandy untill you get a curious user, like me, who either sees the ghost link (move the mouse over and your cursor and status bar will reflect there being a link) or views the HTML for some reason and sees it. But then again, I can just pull out tor or something if I get banned.

--
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
Re:Ideal solution... by Spoing · 2005-06-09 17:53 · Score: 1

All that is fine and dandy untill you get a curious user, like me, who either sees the ghost link (move the mouse over and your cursor and status bar will reflect there being a link) or views the HTML for some reason and sees it. But then again, I can just pull out tor or something if I get banned.
If someone is that inattentive, they get banned.

If you want implementation details -- and it looks like you indeed do -- I'll be glad to provide them to you for a fee. Are you that curious, or can you figure some of the basics out for yourself?

--
A firewall can not protect you from yourself. Turn off what you do not need. Do not use the firewall to do your work.
Re:Ideal solution... by stoborrobots · 2005-06-09 20:12 · Score: 2, Interesting

I always liked the way that arxiv.org dealt with this matter. It clearly says that it will initiate a seek and destroy against your site, if you visit a certain link.

If you do go there, it initiates a countdown.... I've never stuck around long enough to see what happens when the countdown finishes... I like my internet connection just a little too much for that... :-)

--
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco
Re:Ideal solution... by ivan256 · 2005-06-10 06:41 · Score: 1

I've never stuck around long enough to see what happens when the countdown finishes...

Judging by the fact that I can still post this after trying it, I'd say little or nothing.
Re:Ideal solution... by Nate+Eldredge · 2005-06-13 19:01 · Score: 1

Yeah, it's a total hoax. Just there to scare the naive. Cute, but kind of d
Re:Ideal solution... by hankwang · 2005-06-14 08:27 · Score: 1

I've never stuck around long enough to see what happens when the countdown finishes...
Tried it, nothing happened, could still access the site. Since the page is from 1996 or so, it might be sending a ping of death to your IP. Back then you could crash a Windows computer by sending it a nonstandard ping.

--
Avantslash: low-bandwidth mobile slashdot.

Big name != "real" by droleary · 2005-06-09 11:59 · Score: 4, Informative

I see that there appear to be real, legitimate, search engines that do not follow robots.txt rules.

No, you rather see some well-known search engines that generate illegitimate traffic instead of behaving properly. I note a number of them in this highly-documented robots.txt file. I'm personally most offended by idiots running this shit, since there is no single IP block to blacklist.

Re:Big name != "real" by alienw · 2005-06-09 14:08 · Score: 1

Heh. I think that your attitude is part of the problem here. If you try to piss people off, they will try to piss you off too.
Re:Big name != "real" by droleary · 2005-06-09 17:08 · Score: 1

Heh. I think that your attitude is part of the problem here. If you try to piss people off, they will try to piss you off too.

I don't see your logic. By default, everyone gets to see the web site. In order to be singled out and disallowed, they must make the first effort to piss me off. If they go further and ignore robots.txt, I go further and ban by IP. At no time am I responsible for any escalation. If you think my "colorful" language is disturbing, I would say it is better for them to be able to see the "why" of the disallow so that, as if they actually cared, they could change their ways. Somehow, though, I doubt that they'll have a human look at it if they won't even bother to have their spiders look at it.
Re:Big name != "real" by IainHere · 2005-06-09 21:02 · Score: 1

For what it's worth (not much) I think your approach is perfect. And very funny. My favourite example for people who didn't follow the link:

# Another bot that ignores * disallows, even though they claim they follow the protocol.
# And what the hell is with Yahoo-VerticalCrawler-FormerWebCrawler in the agent? Pick a name!
# This may be the same bot that was listed as FAST above, but it gets a special list.
# Dirty, dirty bot. I kind of hope this is ignored so I get to block by IP.
# Update: It is! I do!
User-agent: fast
Disallow: /

looksmart? by Anonymous Coward · 2005-06-09 12:05 · Score: 0

After wading through the Apache logs, after being hit hard for three hours by a very unfriendly spider looksmart? they do that a lot around here. /me blocks them at the FW

Blackhole them at the border routers by anticypher · 2005-06-09 13:21 · Score: 2, Interesting

There was a bunch of fsckwits called dir.com who had a real nasty spider crawling all over the place a few months ago. It blatantly ignored robots.txt, tried dictionary attacks to detect unlinked parts of the website, and may have been trying exploits to crack systems to discover secrets normally protected by passwords or logins. Honeypot email addresses fed to the spider would be spammed within days.

After too many complaints from clients about this nasty behaviour, a number of carriers started blackholing the prefixes of bad spiders at the border routers. Nice simple solution, and then you don't even see the spider traffic. Last I looked, about 20 major ISPs were blackholing prefixes of the worst spider/bot offenders.

Nobody would dare to blackhole google, but there are hundreds of google wannabe's and a few of them are unethical enough to get blocked. And then they wonder why they can't see 75% of the internet.

the AC

--
Hemos is like...sci-fi fans;he thinks technology is cool, but he hasn't bothered to understand the science it's based on

Incorrect Usage by Anonymous Coward · 2005-06-09 14:07 · Score: 0

The robots.txt file is a compapability patch for T-* units. This filter enables them to more efficiently search for John Conner, or provide instructions to other T-* units. Spiders, arachnids, humans and other non authorized users are not allowed to view the true values encoded in the file.

Re:Incorrect Usage by Anonymous Coward · 2005-06-09 17:07 · Score: 0

The spider is made of liquid metal.

Simple Solution by Refrozen · 2005-06-09 15:33 · Score: 1

Just block the bot from your site, or write some simple PHP to restrict it from querying the pages you want, and the frequency....

I'd just block the "bad-bots" though, if they don't listen to you, don't give them contact.

Or, contact the owner of the domain and get mad at them for spidering without following proper spider rules. He is wasting <b>your</b> resources in exchange for <b>their</b> profit, get mad, get even!

Ah yes... by Anonymous Coward · 2005-06-09 15:56 · Score: 0

Ah yes, the "write some simple PHP" solution!

How to keep bad robots away by stoborrobots · 2005-06-09 20:15 · Score: 2, Informative

http://www.fleiner.com/bots/

I found this site through some slashdotter website long back... I've forgotten where and when, but it lends itself nicely to the topic...

Also good it the way arxiv.org fights back.

--
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco

Known Bad Bots by stoborrobots · 2005-06-09 20:34 · Score: 3, Informative

Oh, yeah, and to actually answer the OPs question, there are lists of known bad bots out there...

--
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco

Got Zerg Source? by Kalak · 2005-06-10 00:24 · Score: 2, Informative

WPoison is a Perl script, as source (naturally).

WPoison is actually better from a technical standpoint, as it's a random page each time, not just a block of pages you download.

--
I am, and always will be, an idiot. Karma: Coma (mostly effected by .hack)

Re:Got Zerg Source? by paulatz · 2005-06-10 00:28 · Score: 2, Interesting

It is better from a tchnical standpoint, but it could be worse from a practical one. Expecially if WPoison generated pages can be automatically detected.

--
this post contain no useful information, no need to mod it down
Re:Got Zerg Source? by Kalak · 2005-06-10 01:10 · Score: 2, Interesting

I hadn't considerd that until this morning, but you can add to the source to do things like randomize meta tags, include text from other pages at random, etc. to make it less likely to detect a pattern.

If you're *really* serious about non-detection, then you should vary the amount of poison in the pages, so that some will be merely annoying or almost innocent, with links that are completely lethal.

If I was a perl hacker (instead of merely playing a sysadmin at work), I'd write this idea out, so if anyone here wants to have a go, post a link.

--
I am, and always will be, an idiot. Karma: Coma (mostly effected by .hack)
Re:Got Zerg Source? by Intron · 2005-06-10 04:40 · Score: 1

The Wpoison copyright requires you to put their logo on your website, which would be kind of a tipoff, right there. If I wrote a spider that did look at robots.txt I might not crawl a site with that logo. Some people just don't like spiders.

--
Intron: the portion of DNA which expresses nothing useful.
Re:Got Zerg Source? by Kalak · 2005-06-11 00:33 · Score: 1

Well, I guess I need to remove wpoison, since mine is a kid-friendly site and is definitely not a place where a logo like that is appropriate.

--
I am, and always will be, an idiot. Karma: Coma (mostly effected by .hack)

easy solution by griasr · 2005-06-10 03:49 · Score: 1

i once had to deliver a solution for that problem to a friend. i made him a php script that detects the content directory and generates a javascript-website which links into the content directory with an encrypted javascript-link which cannot be used by spiders. the content directory is being renamed to some random name every hour. the error404 leads people to the entry-page, in case they surf the content dir while it is being renamed.

whitehouse.gov/robots.txt by CommandoB · 2005-06-10 07:26 · Score: 5, Interesting

The whitehouse seems to take a "pre-emptive" approach. Just in case they ever put stuff on the internet that they might someday not want you to see (or that they might not want archived by google), they seem to cover all the bases in their 92KB robots.txt file.

My personal favorites:
Disallow: /911/iraq Disallow: /911/patriotism/iraq Disallow: /911/patriotism2/iraq Disallow: /911/sept112002/iraq [sic.]

There's a theme here. Can you spot it? I'd like to think it's intentional, but at 2255 lines, it may just be that all permutations of Republican buzzwords have been covered.

--
Not that I post on slashdot or anything.

Re:whitehouse.gov/robots.txt by RealSurreal · 2005-06-11 11:29 · Score: 1

i think i'm gonna puke. mod parent up.
Re:whitehouse.gov/robots.txt by illuminatedwax · 2005-06-13 11:46 · Score: 1

You missed the best one:
Disallow: /wmd/text

--
Did you ever notice that *nix doesn't even cover Linux?

Non-solution by Anonymous Coward · 2005-06-10 08:36 · Score: 0

There's nothing stopping someone coding a bot that groks script. What about people browsing without scripting?

That's right, you're a moron!

Re:Non-solution by griasr · 2005-06-13 02:04 · Score: 1

thank you, i run a snowboardcompany named MORON... and NOFX wrote a song for my brother and me "MORON BROS"... extend my non-solution with a flash entry instead of javascript to be more secured from javascript-able bots. to be exact, it was a solution since he was very happy with it and bots stayed in fact away from his site. sometimes you got to deliver solutions you find gay yourself. i hate any client side stuff, but sometimes you gotta suck dick in order to get your $$$.

Oh, they'll regret it if you try this. by jd · 2005-06-10 14:36 · Score: 1

Use Apache's SSI to detect browser type. If it is a known bot type, have Apache return the results of a PHP script that creates a valid header, then pipes /dev/urandom through uuencode (trimming off anything that makes it clear that it's UUencoded), so that their database then has to process a bunch of garbage.

Alternately, use it to your advantage. Have a page of text that is nothing other than porn-related words, and have Apache return that when the bot comes looking. You're guaranteed to get a lot more visitors!

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)

On a similar note... by Transcendent · 2005-06-11 16:32 · Score: 2, Interesting

What is with requests for http://xxx.slashdot.org/ok.txt coming through on my webserver as if someone (Slashdot if you trace the IP) is trying to use it as a proxy?

66.35.250.150 - - [29/Jan/2005:09:50:54 -0500] "GET http://it.slashdot.org/ok.txt HTTP/1.0" 404 650 "-"
66.35.250.150 - - [31/Jan/2005:23:24:04 -0500] "GET http://slashdot.org/ok.txt HTTP/1.0" 404 647 "-"
66.35.250.150 - - [04/Feb/2005:23:21:43 -0500] "GET http://slashdot.org/ok.txt HTTP/1.0" 404 647 "-"
66.35.250.150 - - [08/Feb/2005:21:55:18 -0500] "GET http://it.slashdot.org/ok.txt HTTP/1.0" 404 650 "-"
66.35.250.150 - - [11/Feb/2005:20:27:09 -0500] "GET http://slashdot.org/ok.txt HTTP/1.0" 404 647 "-"
66.35.250.150 - - [21/Feb/2005:20:02:05 -0500] "GET http://games.slashdot.org/ok.txt HTTP/1.0" 404 653 "-"
66.35.250.150 - - [02/Mar/2005:20:56:12 -0500] "GET http://it.slashdot.org/ok.txt HTTP/1.0" 404 651 "-"
66.35.250.150 - - [08/Mar/2005:20:37:50 -0500] "GET http://slashdot.org/ok.txt HTTP/1.0" 404 648 "-"
66.35.250.150 - - [12/Mar/2005:09:43:37 -0500] "GET http://yro.slashdot.org/ok.txt HTTP/1.0" 404 652 "-"
...(continues, of course)

I know the article is about bad spiders, but why is slashdot doing this?

Re:On a similar note... by afidel · 2005-06-11 20:26 · Score: 3, Interesting

I asked rob and he said they check for DDoS's whenever someone try's to post anonymously from an address. I told him it was busted because no one posted anonymously from my IP, and furthermore it's bad netiquet to port scan someone just because they accessed your site. Don't think he cares.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:On a similar note... by mcbridematt · 2005-06-15 23:02 · Score: 1

Slashdot does this to see if you are posting from a open proxy.

Slashdot Mirror

Who Isn't Paying Attention to ROBOTS.TXT?

85 comments