Domain: searchengineworld.com
Stories and comments across the archive that link to searchengineworld.com.
Comments · 26
-
Re:Ummm, duh?
From another article:
"About 34 percent of Internet usage in China takes place in Internet cafes, which are more popular in rural areas, where they account for about 48 percent of Internet usage, according to the study, which also notes that Internet access both at home and at work is growing rapidly in China."
Whatever way you slice it, that's still fewer computers.
-
robots.txt?
Isn't it because of their robots.txt? http://talkorigins.org/robots.txt
# robots.txt for http://www.talkorigins.org/
# This document is to tell robots
# (sometimes called spiders) which are
# means of automatically grabbing our files
# what they can and cannot do. Robots are
# used by search engines, archivers
# (www.archive.org for example) and by spammers
# looking for email addresses
# This file must be in the root directory and called
# robots.txt
# This file can be validated at
# http://www.searchengineworld.com/cgi-bin/robotchec k.cgi
# More info can be found at
# http://www.robotstxt.org/wc/exclusion-admin.html
# and many other places via Googling robots.txt
# User-agent '*' means any robot.
User-agent: *
Disallow: /faqs/comdesc/contact.html
Disallow: /faqs/comdesc/DLTtools.js
Disallow: /faqs/comdesc/drafts/
Disallow: /faqs/comdesc/ICsilly.html
Disallow: /origins/contact.html
Disallow: /cgi-bin/
Disallow: /scgi-bin/
Disallow: /work/
Disallow: /rss/test.xml -
Re:Why do I get the feeling...Their Robots.txt file is actually quite funny.
I did a quick google search for "User-agent" and "Disallow" like this: http://www.google.co.uk/search?hl=en&q=User-agent
+ Disallow&btnG=Google+Search&meta=I chose the first title http://www.searchengineworld.com/robots/robots_tu
t orial.htm and started reading"User-agent
The User-agent line specifies the robot. For example:
User-agent: googlebot"
-SNIP-
"Disallow:
The second part of a record consists of Disallow: directive lines. These lines specify files and/or directories. For example..."
-SNIP-
"If you leave the Disallow line blank, it indicates that ALL files may be retrieved." -SNIP-
So, guess who is to blame ?
-
Re:"Shooting themselves in the foot" is right
"The funny thing is that the news companies just cant opt out of it...without making every other person that comes to their site sign up for an account or something similar that would effectively close the news site off to pretty much everyone and everything on the web."
robots.txt
http://www.searchengineworld.com/robots/robots_tut orial.htm
Are we claiming that google does not obey robots.txt?
all the best,
drew
------
http://www.ourmedia.org/node/58805 -
Re:Google does as paper does
I have been around (on the net) since before AltaVista and I would guess that web crawlers have been operating before many news sites got on the web.
Surely they know how these things work. Just put a robots.txt or a password login on their site and put the kabosh on google. What is the problem?
http://www.searchengineworld.com/robots/robots_tut orial.htm
Just a clue.
Or hey, block google's ip addresses.
all the best,
drew -
robots.txt
If a newspaper doesn't want to be indexed, then they can update their robots.txt file.
-
26 steps guide, recommended reading
Another good resource is this old but still very applicable guide, 26 steps to 15k a Day.
-
Re:Publisher's Have a Bug Up Their Ass
This is an opt-OUT program. Fundamentally, this is flawed. I mean even webpage search engines are opt-in. Your website doesn't get indexed unless you submit it, yet Google are using the webpage parallel as an example of why they should be allowed to proceed with the Print program.
This is actually not true. A site will be indexed without being submitted to google. All that's needed is for a link there to be found by Google's crawlers somewhere on the web. Webpage indexing is in fact opt-out. You opt out by using the robots.txt file. -
Robots.txt?
It should be widely known by any webmaster that you can simply place a robots.txt in your index folder and Google, archive.org, or any major archiving service will simply leave your whole site alone. No questions asked. It seems that going after google for something that would only take 5 minutes on your behalf is a little overboard.
-
We need new directives in robots.txt
It sounds like we need new directives in robots.txt to categorize what's permitted and what's not permitted regarding caching, archiving, display of conextual text around a search hit, time-to-live, copyright jurisdiction, etc.
If some countries want to implement asinine policies, at least sites should have the decency (and the means) to let global information services, such as Google, respect those policies. -
Excuse me but...
First and foremost, the existance of a robots.txt does not constitute a contract between the client (a web surfer/browser agent) and the server (the site hosting the content proper). Repeat that over and over. There is nothing stating that the existance of robots.txt on your server must be requested by my crawler or spider.
Its preferred, but not required. Even so, I am free to ignore it if I want, and parse whatever links I see fit to grab. If you make the content public and I want to read that content, I'm going to get it, whether you have robots.txt in place or not.
Secondly, has anyone taken the time to validate the robots.txt file found on the site in question? Note too that they just changed robots.txt on July 8th of this year. Did the previous version validate? Are they trying to rewrite history again? What did the old version look like?
If there is even so much as one error, robots/crawlers are free to ignore/parse/merge/break it as they see fit. It happens all the time, and even when robots.txt is perfectly valid, many robots and crawlers ignore it anyway (msnbot and Yahoo's crawlers are two of the worst offenders here).
But back to the first point, robots.txt is a guideline, not a rule, not a contract, and certainly not something that can be enforced. Does lack of a robots.txt file constitute the legal right to publically redistribute the content? Or store it for later review and retrieval? How do you know any of your former employees from 1996 haven't stored your entire website on floppy, one page at a time? Did they adhere to robots.txt? Did ANYONE adhere to robots.txt in 1996? It seems that there was evaluation of the Robots Exclusion Standard in 1996, but was everyone using it? Not likely.
Microsoft Internet Explorer will certainly store the entire website for "reading offline" if you ask it to do so when bookmarking it. They don't parse robots.txt to exclude pages that shouldn't be stored locally.
Its too bad that people need to try to erase history to prevail in litigation. This isn't George Orwell's 1984... well, at least not yet anyway.
-
Before anyone starts talking about fair use...You don't even need to get that far to see that Google will win. Here are four reasons why:
- If you don't want a search engine spidering your pictures and news stories, don't put them on the web. If AFP were paper only, Google could not violate their copyright. It saves AFP money to stay offline.
- If AFP decides to pay to go online to make money, they should know the rules of the Internet. First rule about search engines like Google: robots.txt. If they don't want Google to spider them, any half-decent Internet expert they hire would be able to keep Google out of their webspace in the time it takes to type
User-agent: *
Disallow: /AFP didn't do their homework, and that's a poor way to protect any investment.
- Speaking of investments, even if they somehow managed to stay completely ignorant of search engine operation, anyone who wants to sell something online needs to protect it. This is as easy as adding password accounts. Other online news services do just that.
- Copyright protects the rights of authors so that they can make money. Why should we give them the benefit of governmental protection when it's obvious they don't care about protecting the content themselves enough to use basic measures to do so?
To sum up: AFP, of their own volition, paid to get on the web. They completely ignored RFCs. They ignored standard practices by established companies in their business sector. They wait until $17M in damages accrue, which doesn't happen overnight. Only then do they cry foul, and sue using copyright law to protect something they won't protect themselves when they have the chance. If you were a judge, which way would you rule?
Notice that I didn't even need to talk about fair use rights. France doesn't use the US Constitution. My arguments are purely economic, and I'm fairly sure the French understand money. If any lawyers at Google are reading this, please fight this suit. AFP are being unreasonable, and need to be taught a lesson.
-
I wonder
Have they tried applying a robots.txt file properly first? Wouldn't it be cheaper?
-
Aye, Robots
I hope everyone knows that google (and other spiders) can be blocked rather easily.
See the URL below for a robots.txt tutorial:
http://www.searchengineworld.com/robots/robots_tut orial.htm
It is still possible to share files on a web server without search engine exposure. -
Re:Hmm. slashdot's robots.txt"Disallow: " with nothing after doesn't disallow everything, but rather disallows nothing. I'm pretty sure it says this somewhere right on the Google site itself, but after much searching, I was only able to find http://www.searchengineworld.com/robots/robots_tu
t orial.htm and http://www.robotstxt.org/wc/norobots.htmlThat, and the adsense thing that someone already mentioned.
-
What surprised me most... [OT]
What surprised me most was the URL of the story!
I remember using Excite as my search of choice for full-text searches, back before Yahoo! started charging for everything, including directory listings. Then, there was Webcrawler, once the home of the canonical robots.txt standard.
I even remember back in the day, when not all AltaVistas were created equal.
Then came Google's PigeonRank system, and it's been downhill (or uphill, whichever you see as a positive metaphor) ever since.
So the Excite.com link was a trip down memory lane. Not that I'm expecting the Good Old Days to return; when I tried to access the home page with my Opera browser, I got an error message: "The browser you're using is not allowing you to sign in to Excite." Don't worry, Excite.com... I won't be trying again. -
Re:Deleting pages won't work
It's easy to remove pages from archive.org. If your domain has a robots.txt file that blocks spiders, then archive.org will not allow people to view those pages either, even if they have been cached in the past.
-
Re:Hey look!
Both of them are identical in that both are opt-out systems that rely on the honor of the implementer to be effective.
Are you suggesting that spammers have less honor than search engine companies? Where's my Pocket Lawyer(tm)...
;)But seriously, the problem with the robots.txt file is not that search engines don't honor it...they do. The problem is that it's not simple enough for the average person posting a personal website to use properly. To some, it's downright daunting. Until using robots.txt is as easy as clicking an opt-out link or dialing a telephone number, it's really not an effective method of securing personal privacy.
-
Re:who cares?
This subject is well covered at the excellent Search Engine World. In particular this article provides a detailed list of things to do to build a site that will rank well on Google. The short summary though is exactly what you say: real content is critical to top Google rankings.
-
Re:who cares?
This subject is well covered at the excellent Search Engine World. In particular this article provides a detailed list of things to do to build a site that will rank well on Google. The short summary though is exactly what you say: real content is critical to top Google rankings.
-
Best Opera ResourceSearchEngineWorld's Gigantic Opera Resource
Notable is their Opera 7 wishlist, which includes a wish for configurable keyboard shortcuts. (yes please)
-
Best Opera ResourceSearchEngineWorld's Gigantic Opera Resource
Notable is their Opera 7 wishlist, which includes a wish for configurable keyboard shortcuts. (yes please)
-
Re:How does Google get away with it?
This wasn't challenged yet to my knowledge. But it is at least disputed.
-
Have you tried the new Opera 6?
It leaves IE for dead in many ways.
& now that MS has dumped Netscape plugins its even more compatible. Plus it has its own mail, news 'n ICQ clients built inside it.
& it gives you the choice of SDI & MDI GUIs
only in a couple of small areas does IE do better.
But a Active X Netscape plugin is being developed as we speak, so soon Opera will be Active X plugin compatible via its netscape plugin vacility.
I admit that Opera 4 was as iffy as hell, but Opera has to be the most improved browser in the last year or so.
Here's the Opera homepage.
This is a great Opera resources FAQs & tips site.
Opera is very configurable, here's how I have it configured
Here's what it looks like without the add -
So say no to the robots :)
You can use a little thing called robots.txt - look it up here or here if you don't know what it is.
Allows really useful features like marking given directories, pages, or files off-limits to a specific robot or all robots in general. Boy... a technical solution to a technical problem instead of a new round of lawsuits?
Quickie examples (this is SO simple folks):
User-agent: *
Disallow: /
Boom! No more google telling that horrible world of pirates and thieves about your site. Not many visitors either though....
So maybe you want to exclude just googlebot from your images and image directory with the following:
User-agent: googlebot
Disallow: /image
This will still allow your main pages to be indexed according to your meta keywords, but will disallow any 'napsterization'. Of course since it requires people running sites to do work and understand technology lots of people will probably decided lawsuits are easier.
Robots.txt DOES require you to run your own domain. If you don't, try using meta tags in the head of the html code for a similar effect, but it is harder to implement (must be on each page rather than site wide) and less supported. Info here.
If you spend that much time on the images... spend 5 minutes making a robots.txt file to indicate you don't want them taken by bots. But always consider anything you put on the net as published, if something's private don't put it on the net. -
Especially since robots.txt lets you disallow thisA little thing called robots.txt - look it up here or here if you don't know what it is.
Allows really useful features like marking given directories, pages, or files off-limits to a specific robot or all robots in general. Boy... a technical solution to a technical problem? Who'd a thunk it?
Quickie examples (this is SO simple folks):
User-agent: *
Disallow: /
Boom! No more google telling that horrible world of pirates and thieves about your site. Not many visitors either though....
So maybe you want to exclude just googlebot from your images and image directory with the following:
User-agent: googlebot
Disallow: /image
If you want to do this for multiple directories, you add on more Disallow lines:
User-agent: *
Disallow: /image
Disallow: /cgi-bin/
Now if you put
meta name="robots" content="All,INDEX"
meta name="revisit-after" content="5 days"
in your code to show up high on the search engines, you shouldn't be surprised or upset when you SHOW UP HIGH ON THE SEARCH ENGINES.
Not all robots follow the robots.txt standard, and there's no way of forcing them too. But google does, and that seems to be the big concern here.
A real life example, slashdot's robot.txt file (at slashdot.org/robots.txt):
# robots.txt for Slashdot.org
User-agent: *
Disallow: /index.pl
Disallow: /article.pl
Disallow: /comments.pl
Disallow: /users.pl
Disallow: /search.pl
Disallow: /palm
Disallow: index.pl
Disallow: article.pl
Disallow: comments.pl
Disallow: users.pl
Disallow: search.pl