Domain: robotstxt.org
Stories and comments across the archive that link to robotstxt.org.
Comments · 108
-
It is already optional
Any news service not wanting Google to display their articles in Google News just needs to add robots.txt file to their website which asks Google not to index their site. Google will then not index the site, and they will not show up in any Google News article or web searches. That these news services don't do this with a simple robots.txt file tells you their true motivations.
The only reason this proposed law exists is because these news services want to force Google to index them, and also pay them. That is, they want the service Google is offering, but instead of paying for a desired service (or accepting it for free, which is what Google currently does) like everyone else does for something they want, they instead want Google to pay them for it.
It's like someone building a road to make it easier for people to reach a shopping mall. Then the stores in the shopping mall demanding the road owner pay them because the road would not get the traffic it does if it weren't for the presence of the stores. The correct base level of comparison here is before the road was built. The road results in increasing traffic to the stores, so it is already a benefit to the stores (the road owner is already "paying" them via increased visitors). It's completely backwards from how an economics is supposed to work. And the misguided belief only exists because these copyright holders have been living in a protected bubble provided by the monopoly copyright law gives them, which shields them from normal economic forces. -
Re:Simple solution for Google & Facebook
Right. But isn't this was robots.txt is for? Perhaps we need to update the RFC to indicate that the page(s) are okay for search results, but not okay for aggregators? Seems like a simple fix that doesn't involve lawyers.
-
Re:The upshot is
http://www.robotstxt.org/robot...
There are two important considerations when using
/robots.txt:robots can ignore your
/robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.
So don't try to use /robots.txt to hide information. -
link and robots.txt commentHere's the link: http://curia.europa.eu/juris/document/document.jsf?text=&docid=138782&pageIndex=0&doclang=EN&mode=lst&dir=&occ=first&part=1&cid=324003
I think that paragraph D.41 is important to remember:41. Source web pages are kept on host servers connected to internet. The publisher of source web pages can make use of ‘exclusion codes’ (27) for the operation of the internet search engines. Exclusion codes advise search engines not to index or to store a source web page or to display it within the search results. (28) Their use indicates that the publisher does not want certain information on the source web page to be retrieved for dissemination through search engines.
And footnote 27 about those "exclusion codes" says:
27 – A typical current exclusion code (or robot exclusion protocol) is called ‘robots.txt’; see http://en.wikipedia.org/wiki/Robots.txt or http://www.robotstxt.org/.
Now I know that almost all Slashdotters already know about this, but if this passes then it means that (in the EU) it is written in law that a search engine spider isn't allowed to use stuff that you kept out with your robots.txt file.
-
robots.txt -- yet again
If the law passes, the search engines will go "fuck that" and only index free content or newspapers that specifically allow their stuff to be indexed for free. The other newspapers will lose their only remaining readers under fifty and die out along with that generation.
The supreme stupidity here is that a law is not needed. If the sites don't want to be indexed, it's dead simple to set up robots.txt to keep out Google and the others. But that's been pointed out thousands of times by now. So if they and the courts are not getting it by now, it is because they choose not to get it.
-
Re:What's it look like?
https://joindiaspora.com/robots.txt
---
# See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
#
# To ban all spiders from the entire site uncomment the next two lines:
User-Agent: *
Disallow: /people/
Disallow: /u/
---Which seems insane for a social networking site to do.
-
Re:Strange sense of morals
From http://www.robotstxt.org/robotstxt.html: Web site owners use the
/robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.robots.txt is not a "forbidden list." It is simply a polite request to avoid a robot crawling things that should not be indexed. It is often used to avoid a bot pulling an ftp site published via http or crawling dynamically generated content.
Nothing illegal, immoral or fattening about manually accessing a file listed in a robots.txt file. It is rather normal and you likely do it every day without realizing it.
-
Re:GOOGLE BIGGEST PIRATE OF THEM ALL !!
Numpty.
robots.txt is just a plain text file - all you do is create the file in notepad, save it, and upload it to your website like you would any photo or background image. There's no 'Bigwheels'... even a kid can do it, even on the free hosts.
And yes, it does compensate you, by sending traffic to your site (how did you think search worked, by magic?).
Seriously : user guide for dummies
-
Re:I worry about robots
Thankfully we've been defending against this for years.
-
http://www.robotstxt.org/
Says it all really, no need for a melodramatic "article" trying to draw parallels between the non indexed and page ranked portion of the net, and kiddie porn.
Some of us just don't want google indexing out stuff on general principles.
FX, types "brain tumour" into google
up pops page full of links asking me if I want to buy a brain tumour on fleabay...
-
Re:Robots.txt
is there some written law that holds people to following robots.txt? if not, how is it even possible to call it a weakness?
Nope:
There is no law stating that
/robots.txt must be obeyed, nor does it constitute a binding contract between site owner and user, but having a /robots.txt can be relevant in legal cases. (from here -
Re:i'm a little clueless here
On the other hand, that's an utterly asinine comment to have made (the one you quote, not yours). Of course they'll ignore it, why on Earth wouldn't they? It is in no way binding, and robots are free to ignore it, just as site owners are free to block connections from specific incoming IP addresses, the owners of those IPs are free to switch to new ones, and so on, ad infinitum.
-
CPAN webserver broken
The spec for robots.txt says that strings matched internally in the text file should be done in a case insensitive manner.
It would only make sense for a "reasonable person" to assume" that any web fetches for a file name for 'robots.txt' should also match in a case insensitive manner.
This sounds like Microsoft being used to Uppercasing the first letter of words -- which looks aesthetically pleasing, and not having it make any real difference on 70% of the computers on the planet (running Microsoft) and (in my experience, on most webservers running apache). Never noticed any case sensitivity.
This looks like a case of the perl guys being at fault. They likely have a web-server written in perl and DIDn't do a case ignore when processing requests for 'robots.txt'. This violates the intent if not the letter of the spec.
Check out http://www.robotstxt.org/orig.html. It specifies that all of its strings should be matched in a case insensitive manner. IT doesn't explicitly say that the filename 'robots.txt' should also be matched by the webserver, in a case insensitive manner, but if if specifies that all of the web-addresses in the file should be handled in a case-insensitive manner, doesn't it makes sense that the file name it-self should also be case insensitive?
People should use a little common sense before going off and blaming microsoft for doing something that is perfection natural and perfectly understandable, while the supposed victims should be a bit more robust in the design of the web server.
At least, that's how it appears to me -- anyone care to show me a sound reasoning why it should be otherwise or why one would expect otherwise?
-
Re:Decisions, decisions.
"index" == robots.txt.
-
Re:What is going one here?
they're not legally obliged to do it - they do it to comply with standards, and I imagine it is to do with their "don't be evil" (not "do no evil" - doing and being have different implications) slogan, since ignoring robots.txt would be exploitative and antisocial.
-
Re:I don't get it
I don't understand how news sites are going to drop from Google.
google follows the robots.txt file, so if you have that file with the content "User-agent: *
Disallow: /" then google will not index any of that site. Those sites could then either sell a index to bing, or use the useragent string to detect googles indexing requests and only send google the robots.txt file with the disallow. -
Re:Um.
>> You can't voluntarily leave the Google Index.
wrong.
http://www.robotstxt.org/
http://en.wikipedia.org/wiki/Robots_exclusion_standardA fine point, but not quite the same thing as leaving it.
-
Re:Um.
>> You can't voluntarily leave the Google Index.
wrong.
http://www.robotstxt.org/
http://en.wikipedia.org/wiki/Robots_exclusion_standard -
Re:Use a file?
Oh, please don't do that. Don't assume that we have rights to that directory. I already really really wish I could set robots.txt for just my subdirectory, but no can do since some semi-moron thought it would be a good idea to make me mail my school department's webmaster to exclude part of my directory.
You can do everything that you do with robots.txt via robots meta tags and streamline their inclusion with some server-side scripts if so desired.
-
Re:Shoot the messenger!
if you're not placing robots.txt files at the root of all pages having to do with payment, you're making a stupid mistake.
You're kidding, right? That's not how you use robots.txt files at all. http://www.robotstxt.org/robotstxt.html
-
if-modified-since
Crawl-delay directive
Several major crawlers support a Crawl-delay parameter, set to the number of seconds to wait between successive requests to the same server: [1] [2]
User-agent: *
Crawl-delay: 10Further, not only do the Google crawlers obey the robots.txt described above (or other standards for robot exclusion), they also use HTTP's if-modified-since to make a conditional request. The file is only returned to the crawler if it has been changed. That saves a lot of time and bandwidth.
PC World will also lose out if double-dipping is allowed.
-
Re:ZXTM TrafficScript rule:
Don't forget robots.txt!
That kind of script will block useful things like Googlebots too... -
Re:What about register forms?
Does that mean I'll have to introduce methods that waste people's time in order to prevent google from registering on my site multiple times?
Yes, if you require all your human visitors to read your robots.txt, and then require them to check a checkbox to mean that they clearly read and understood the entire body of your robots.txt. Then yes, you'll have to introduce some sort of almost impossible-to-read translucent captcha written in classical Chinese. -
ROBOTS.TXT & CONTENT="NOINDEX", "NOFOLLOW"
http://www.robotstxt.org/
Dang, that was hard. Damn you, GOOGLE! Damn you to HELL! You blew it up! You finially blew up the web!
Or not. -
Re:I don't like that defense
Did they ask all the web site owners? No?
Yes. -
And one helpful standard:
Also helps to know the Robots Exclusion Standard, to keep the riff-raff out.
-
Re:Using customer logins?
Well, robots.txt is your friend.
-
How to exclude Search Engine SpidersTangential to the story, but if you want to exclude a search engine spider, you can use the robots exclusion protocol.
Appears it wouldn't have made too much of a difference here, but perhaps something useful to know.
-
Re:Posted notice?Especially in the day of publishing on the web, where when you decide to stop pulishing, it's gone. If you publish a book and sell it, whoever bought the book can come back to it over and over. If you remove your webpage it's gone -- unless some asshat corporation (non-profit or otherwise) comes along and decides to republish your content without your permission. So if I write something like a web cache such as Coral, I'm violating your copyright if the information on your site dissappears? I have the feeling that you just don't understand the way the internet works. People who "publish" content make it available, and people who "surf" the content pull down copies. Those copies are transmitted through several network systems on the way to their destination, and any network in the middle could copy the data for purposes of caching or otherwise.
It's just the way the internet and web work. Your little opinions don't really matter. If the web didn't work this way, it would not be reliable, and it would not be useful. Search engines and archives and caches are services without which we would get significantly less benefit from the information put out on the web.
If you don't want to make information publicly available, don't put it on the internet, and don't advertise its presence by linking your content. Sue people who violate your copyright if they put it on the internet. If you want to make information publicly available to a private community, then use appropriate technologies like Bulletin Board software with accounts and passwords.
Copyright is not violated by computers. Copyright is violated by people. You can't sue computers. Large organizations with lots of money (big targets for lawsuits) will in general try not to violate copyright, but they aren't in general responsible for the copyright violations that the masses do through their services, just like how the FSF isn't responsible if you use their software to violate copyright or Microsoft isn't responsible if you use their software to violate copyright. Wouldn't it suck if Microsoft Word didn't let you "Copy and Paste", because "God forbid you might be copying/pasting something that was copyrighted by someone else, and Word can't make the distinction". Being indexed should be opt-in. Just like being spammed. Robots.txt can be used to advertise where a crawler should index as well as where a crawler should not. It is both opt-in and opt-out at the same time. Before you write another uninformed word about it, you should read more about it. You should also read this google blog post, this google blog post, The main robots.txt site, and the RFC.
There are crawlers which "violate robots.txt" (usually those crawlers are just poor implementations by people learning to write a crawler or unfinished programs - people who write real crawlers in general understand that you probably have good reasons for not wanting them crawling those pages of your site, and they don't want to waste their bandwidth on them). -
Re:Posted notice?Especially in the day of publishing on the web, where when you decide to stop pulishing, it's gone. If you publish a book and sell it, whoever bought the book can come back to it over and over. If you remove your webpage it's gone -- unless some asshat corporation (non-profit or otherwise) comes along and decides to republish your content without your permission. So if I write something like a web cache such as Coral, I'm violating your copyright if the information on your site dissappears? I have the feeling that you just don't understand the way the internet works. People who "publish" content make it available, and people who "surf" the content pull down copies. Those copies are transmitted through several network systems on the way to their destination, and any network in the middle could copy the data for purposes of caching or otherwise.
It's just the way the internet and web work. Your little opinions don't really matter. If the web didn't work this way, it would not be reliable, and it would not be useful. Search engines and archives and caches are services without which we would get significantly less benefit from the information put out on the web.
If you don't want to make information publicly available, don't put it on the internet, and don't advertise its presence by linking your content. Sue people who violate your copyright if they put it on the internet. If you want to make information publicly available to a private community, then use appropriate technologies like Bulletin Board software with accounts and passwords.
Copyright is not violated by computers. Copyright is violated by people. You can't sue computers. Large organizations with lots of money (big targets for lawsuits) will in general try not to violate copyright, but they aren't in general responsible for the copyright violations that the masses do through their services, just like how the FSF isn't responsible if you use their software to violate copyright or Microsoft isn't responsible if you use their software to violate copyright. Wouldn't it suck if Microsoft Word didn't let you "Copy and Paste", because "God forbid you might be copying/pasting something that was copyrighted by someone else, and Word can't make the distinction". Being indexed should be opt-in. Just like being spammed. Robots.txt can be used to advertise where a crawler should index as well as where a crawler should not. It is both opt-in and opt-out at the same time. Before you write another uninformed word about it, you should read more about it. You should also read this google blog post, this google blog post, The main robots.txt site, and the RFC.
There are crawlers which "violate robots.txt" (usually those crawlers are just poor implementations by people learning to write a crawler or unfinished programs - people who write real crawlers in general understand that you probably have good reasons for not wanting them crawling those pages of your site, and they don't want to waste their bandwidth on them). -
Re:Posted notice?
Maybe, but robots.txt (aka the robots exclusion standard) is one of the most misunderstood standards on the intarweb, so it doesn't surprise me that people are misguided enough to think that robots.txt turns copyright into an opt-in law.
Robots.txt is a hint to robots so that they don't fall into bottomless pits, of which there are many on the web, start posting/deleting/modding comments or try to archive huge files which they can't handle. It is not a way of forbidding access. If you don't believe me, get the info straight from the horse's mouth: http://www.robotstxt.org/wc/exclusion.html (note how that page is full of "should"s and "indicate"s) -
Re:Big deal?
I believe that most people would agree that opt-out does a service to the vast majority of site publishers who are unaware of robots.txt. That's my arguement.
Opt-in would educate those site publishers pretty quickly.
The bigger issue though, is that traditionally robots.txt files are designed for opt-out. There is a draft RFC (written in 1996) that allows fine-grain control with both Allow and Disallow commands, which could be further extended in the google age to approve or deny additional options like caching or thumbnailing, etc.
Misbehaving bots using the file specifically to find things that shouldn't be spidered could be whacked pretty easily by starting the file with a dummy agent name that allows and denies scripts that add visitors IP addresses to a ban list. -
Re:Posted notice?
If she didn't bother to post the notice correctly, the case should be just thrown out.
While I agree with you in principle, the law suggests that she did post the notice correctly, since (from what I gathered FTFA) the law doesn't make any distinctions between a human eyeball and a robot eyeball. ...attorney John Ottaviani, ..., says the issue is "whether there was 'an adequate notice of the existence of the terms' and a 'meaningful opportunity to review' the terms."
I'd suggest that not using robots.txt & putting the contract terms at the bottom of the page does not provide a bot with "'an adequate notice of the existence of the terms'" since the bot has to grab the entire pade before getting to the notice... which is why robots.txt gets checked before anything is spidered.
The Judge may disagree & say that spiders will just have to learn to read.
There's also a meta 'robots' tag, which bots may or may not recognize. -
robots.txt
It's called a robots.txt file, and all reputable webcrawlers/search engines respect it. We don't yet have the technology to read arbitrary exclusion notices posted in English, but we do have a widely used machine-readable standard for requesting that your pages not be indexed. If the person in question was so adamant about her site not being crawled, she should have spent the 5 minutes needed to read up on robots.txt.
-
/robots.txt and meta elementsMaybe Google should be kind enough to ask permission in the form of webmasters creating robots.txt Only the administrator of a host has the authority to modify
/robots.txt (note the initial slash). Web publishers who use, say, GeoCities or the web space included with most ISP plans do not have the authority to modify files in /, only files in /~tepples. Such publishers rely on meta elements, and when meta elements are not present, web copynorms and the default rules specified by 17 USC 512 and foreign counterparts should take precedence. instead of just assuming that anyone who doesn't go out of their way to satisfy Google's policy is an open target? The /robots.txt and meta element protocols have been around for nearly a decade. Specifying rules in a meta element is no more going out of one's way than specifying a stylesheet in a link element. This is like you crossing a stranger's property... in all human decency it's normally to ask before crossing Unless one is crossing the property on a sidewalk that runs through the property. -
Re:Google's in C++?
No Google bot is written in C++
http://www.robotstxt.org/wc/active/html/googlebot. html
When you need high performance, C++ is better choice than any other language. Google(or Yahoo) wont have a single language framework to run its platform. Always it will be combination of languages. Whatever have I read so far Google's core search engine is in C++ and several C++ libraries are available as python modules. Standalone products may be written in specific languages. Gmail and Google Calender are written in Java. -
robots.txt?
Isn't it because of their robots.txt? http://talkorigins.org/robots.txt
# robots.txt for http://www.talkorigins.org/
# This document is to tell robots
# (sometimes called spiders) which are
# means of automatically grabbing our files
# what they can and cannot do. Robots are
# used by search engines, archivers
# (www.archive.org for example) and by spammers
# looking for email addresses
# This file must be in the root directory and called
# robots.txt
# This file can be validated at
# http://www.searchengineworld.com/cgi-bin/robotchec k.cgi
# More info can be found at
# http://www.robotstxt.org/wc/exclusion-admin.html
# and many other places via Googling robots.txt
# User-agent '*' means any robot.
User-agent: *
Disallow: /faqs/comdesc/contact.html
Disallow: /faqs/comdesc/DLTtools.js
Disallow: /faqs/comdesc/drafts/
Disallow: /faqs/comdesc/ICsilly.html
Disallow: /origins/contact.html
Disallow: /cgi-bin/
Disallow: /scgi-bin/
Disallow: /work/
Disallow: /rss/test.xml -
change behaviour for bots
In your server, you can code the logic to take another action if the user agent is a bot.
Here you have a db of web robots. -
some points
- Don't forget to check and respect robots.txt. Python has a module that helps you parse that file
- Samie and its Python port Pamie are your friends. You can automate IE so your script is treated as an human and not discriminated as a robot.
- I use such beasts to do one-click time reporting at work and one-click cartoon collecting in my favorite newspaper.
- And once I even repeatedly voted on an online poll and changed the course of history.
- Ah, yes, TFA was about building a spider on Linux. I didn't check if my one-click IE scripts work on IE/Wine/Linux.
- If I write an one-click script for online shopping, does it infringe the infamous Amazon patent?
- When will Firefox's automation capabilities match those of IE?
- Don't forget to check and respect robots.txt. Python has a module that helps you parse that file
-
Re:I don't get it
What I don't get is why the newspaper in question didn't just throw up a robots.txt file, blocking Google's news spiders, and then ask politely for Google to remove all existing content from their indexes.
I guess they'd just rather flex some highly paid lawyer muscle and deal with the expenses of a court battle than get some web monkey sat in a broom cupboard somewhere to take 10 minutes out of their busy schedule and do this... -
Incompetence at work
Any competent web developer should know how to use the The Robots Exclusion Protocol to prevent crawlers from crawling/indexing a web site. Why News Sites do not want to be visited by Google is really beyond me - it is free advertising! Visitors still have to visit the news sites if they want to read anything but a short article summary.
-
Re:The problem is Google Cache, I think
You probably want to read Google's Guide for Webmasters and the Robots Exclusion guides.
In my experience, Google no longer caches websites that haven't been indexable in some time. That is to say, if you remove the page or even better -- replace it with an empty one that links to an excluded page, Google should (and most likely will) remove the cache of the originally indexed page. I'd expect this to happen within a month or so (from my experience).
No guarantees. -
Re:I sense a little two-faced opinion here
Maybe you need to inform yourself of what Robot Exclusion is and isn't.
Its purpose is not to censor information but to avoid incident by agressive robots that could stress WWW servers (introduction in the first link).
HA action is revisionism. Like a politician yelling something then a few years later claiming he never said such a thing and threatening people with a piece of evidence to the contrary. -
Re: alternatives
My personal web site didn't exactly have this story in bold type off the main page or anything.
What you need to do is put a "robots.txt" file into the root of your web directory to prevent certain parts of your website from showing up on a search engine. (See this F.A.Q.) That's how I prevent the photographic images on my website from showing up in an image search. You need to do the same on for the BBS section. Just Google for your name and see what comes up. You might be surprised. -
Did you support John Seigenthaler?
So you're saying that Google have a specific procedure for requests, and requests which aren't following their procedure are legally invalid?
I'm saying that the procedures recognized by Google (including the WebCrawler Robots Exclusion Standard) have been part of the copynorms for longer than I've been able to vote or use the Web.
I suspect a good old fashioned certified letter to their headquarters would be legally acceptable.
But why send a certified letter to the operator of each conforming search engine when you can more easily remove sensitive pages from all of them at once using well-known robots exclusion methods? Is it because your name is John Seigenthaler, who chose to make a big deal of things instead of just editing an alleged libel out of his own biography?
-
Re:Easily solved with software
A quick search untility
I think you meant a 'Web' Robot.
http://www.robotstxt.org/wc/faq.html
programmed correctly you can even assign the robot a login/pass to default to when asked :) and make sure the robot can search even those pages for info that shouldn't be available on the web that easily. -
robots.txt for books
Maybe in the near future we will see some sort of robots.txt page at the start of every book.
That would be a solution publishers could use. -
Let's not go to Canada. 'Tis a silly place.
- Any company that wants to put copyright material on their web site, but doesn't want it indexed, should learn about the robots.txt file.
- As stated in TFA, the law would make any search engine illegal. Given that hiding your site from all search engines makes it pretty much invisible to the rest of the internet, why bother to have a public web site anyway?
- Any company that wants to put copyright material on their web site, but doesn't want it indexed, should learn about the robots.txt file.
-
Re:You can't change historyIf you check http://www.robotstxt.org/wc/robots.html, you'll note that there are no date range options to the robots.txt file. In other words, you can't specify that historical data is to be excluded.
Yes, the article is misleading. What the Internet Archive does is respect the user-agent diallow -- and if the crawler finds that it is disallowed, it will stop access to previously archived material. You can read about it here: http://www.archive.org/about/exclude.php
-
You can't change history
If you check http://www.robotstxt.org/wc/robots.html, you'll note that there are no date range options to the robots.txt file. In other words, you can't specify that historical data is to be excluded.
Aside from that, posting a robots.txt file after the lawsuit is like republishing source code under a different license. The new license does not affect the licensing of older copies of the code other people have saved away.
If you were allowed to expect caches to retroactively honor robots.txt, you could expect a flood of lawsuits from unscrupulous people adding robots.txt to their websites after they'd been added to archives.