Domain: robotstxt.org
Stories and comments across the archive that link to robotstxt.org.
Comments · 108
-
Re:what! no Google?
"... we can find what we are looking for, whether those with the domains like it or not... "
Actually, those with the domains can prevent Google from indexing certain pages or even the entire site.
Search engines like Google use a technology called Robots. Search engine robots automatically scour the web returning information about each page they visit to the search engine's indexing search. This is then cache and correlated in a database.
Robots can be blocked by using a robots.txt file in the root directory of the web site. So, if your domain is foo.com, the corresponding robots.txt file would be at http://foo.com/robots.txt
Check out Web Server Administrator's Guide to the Robots Exclusion Protocol -
Re:I've got a solution!
s/secret.xml/robots.txt/g
http://www.robotstxt.org/wc/norobots.html -
Well Behaved Crawlers
...obey the Robot Exclusion Standard. This is not a big secret, and is linked to by all major search engines. Anyone wishing to exclude a well-behaved robot (like those of major search engines) can place a small file on their site which controls the behaviour of the robot. Don't want a robot in a particular directory? Then set your robots.txt up correctly.
P.S. Anyone keeping credit card info in a web directory that's accessible to the outside world should really think long and hard about getting out of business on the internet. -
Re:Proposal won't work: No incentive!
But if we could have kept search engines from returning it, that would have been even better. Since in our case the page was intended for internal use, we don't care whether anyone can find it from the Internet. Our real users know where to look for it.
http://www.robotstxt.org/wc/exclusion.html -
Isn't copying on the internet authorized?
Files don't get on the internet by accident! It is inherent in the medium that when you put your files under the control of a web server and have it listen for connections on a network and respond that copies will be made. If that isn't good enough, password protect your website. Using a webserver without access control is an opt-in system that clearly authorizes some copying. The only question is how much copying is authorized.
I fail to see any meaningful difference between infinite copying for free from the original site and transitive copying from a search engine. Since "deep linking" has been held to not be infringement, the argument that you aren't forced to see the whole page is bogus, since an URL can target the individual image file.
You can explicitly unauthorize search engines by using robots.txt, right? . Any splitting of hairs about the scope of copying authorized by the act of putting your file in a web server can be fully accomodated by using robots.txt. Since this standard is publicly available and well known, doesn't placing your files on the web without restricting via this method constitute a grant of authority to everyone with access to your web server to copy? Now if these search engines ignore robots.txt, then that is another matter, but I doubt that is the case.
When you opt-in to copying by placing your files in a web server, but fail to subsequently explicitly opt-out after that, you have authorized copying, so tough.
The photographers say they might have to leave the net. Not so if they follow robots.txt . I don't generally think that forcing off people who won't learn how the net works is a bad thing. These groups are essentially trying to use the courts to create standards. The net already created its own standard for this in 1994. Perhaps we will have the first ruling that essentially says "RTFM". -
So say no to the robots :)
You can use a little thing called robots.txt - look it up here or here if you don't know what it is.
Allows really useful features like marking given directories, pages, or files off-limits to a specific robot or all robots in general. Boy... a technical solution to a technical problem instead of a new round of lawsuits?
Quickie examples (this is SO simple folks):
User-agent: *
Disallow: /
Boom! No more google telling that horrible world of pirates and thieves about your site. Not many visitors either though....
So maybe you want to exclude just googlebot from your images and image directory with the following:
User-agent: googlebot
Disallow: /image
This will still allow your main pages to be indexed according to your meta keywords, but will disallow any 'napsterization'. Of course since it requires people running sites to do work and understand technology lots of people will probably decided lawsuits are easier.
Robots.txt DOES require you to run your own domain. If you don't, try using meta tags in the head of the html code for a similar effect, but it is harder to implement (must be on each page rather than site wide) and less supported. Info here.
If you spend that much time on the images... spend 5 minutes making a robots.txt file to indicate you don't want them taken by bots. But always consider anything you put on the net as published, if something's private don't put it on the net. -
So say no to the robots :)
You can use a little thing called robots.txt - look it up here or here if you don't know what it is.
Allows really useful features like marking given directories, pages, or files off-limits to a specific robot or all robots in general. Boy... a technical solution to a technical problem instead of a new round of lawsuits?
Quickie examples (this is SO simple folks):
User-agent: *
Disallow: /
Boom! No more google telling that horrible world of pirates and thieves about your site. Not many visitors either though....
So maybe you want to exclude just googlebot from your images and image directory with the following:
User-agent: googlebot
Disallow: /image
This will still allow your main pages to be indexed according to your meta keywords, but will disallow any 'napsterization'. Of course since it requires people running sites to do work and understand technology lots of people will probably decided lawsuits are easier.
Robots.txt DOES require you to run your own domain. If you don't, try using meta tags in the head of the html code for a similar effect, but it is harder to implement (must be on each page rather than site wide) and less supported. Info here.
If you spend that much time on the images... spend 5 minutes making a robots.txt file to indicate you don't want them taken by bots. But always consider anything you put on the net as published, if something's private don't put it on the net. -
Especially since robots.txt lets you disallow thisA little thing called robots.txt - look it up here or here if you don't know what it is.
Allows really useful features like marking given directories, pages, or files off-limits to a specific robot or all robots in general. Boy... a technical solution to a technical problem? Who'd a thunk it?
Quickie examples (this is SO simple folks):
User-agent: *
Disallow: /
Boom! No more google telling that horrible world of pirates and thieves about your site. Not many visitors either though....
So maybe you want to exclude just googlebot from your images and image directory with the following:
User-agent: googlebot
Disallow: /image
If you want to do this for multiple directories, you add on more Disallow lines:
User-agent: *
Disallow: /image
Disallow: /cgi-bin/
Now if you put
meta name="robots" content="All,INDEX"
meta name="revisit-after" content="5 days"
in your code to show up high on the search engines, you shouldn't be surprised or upset when you SHOW UP HIGH ON THE SEARCH ENGINES.
Not all robots follow the robots.txt standard, and there's no way of forcing them too. But google does, and that seems to be the big concern here.
A real life example, slashdot's robot.txt file (at slashdot.org/robots.txt):
# robots.txt for Slashdot.org
User-agent: *
Disallow: /index.pl
Disallow: /article.pl
Disallow: /comments.pl
Disallow: /users.pl
Disallow: /search.pl
Disallow: /palm
Disallow: index.pl
Disallow: article.pl
Disallow: comments.pl
Disallow: users.pl
Disallow: search.pl