Follow Up on Google Favoring Yahoo

← Back to Stories (view on slashdot.org)

Follow Up on Google Favoring Yahoo

Posted by ryuzaki0 on Thursday September 14, 2000 @06:44AM from the heres-the-skinny-dept dept.

After yesterday's story about google favoring Yahoo links, I got word from Sergey Brin from google. He says that the reason that the site tested showed so poorly is that a robots.txt file prevented Google's crawler from fully indexing the site. The robots.txt file has since disappeared, and the next index should show a change in the rankings.

14 of 96 comments (clear)

Min score:

Reason:

Sort:

How Google indexed even the excluded parts by yerricde · 2000-09-14 05:44 · Score: 3
If robots.txt was there, how did Google index the site at all (instead of just poorly)?
- The presence of robots.txt doesn't automatically exclude everything, only the directories specified in the file.
- Google can index even robots-excluded sites by looking at the 50 or so characters on either side of the page that links to the excluded pages. That's why Google sometimes gives URLs without any content.
<O ( \ XGNOME vs. KDE: the game!
--
Will I retire or break 10K?
Don't know what robots.txt is? by dale@redhat.com · 2000-09-14 02:02 · Score: 3

If you don't know what robots.txt is, look at A Method for Web Robots Control Internet RFC...

--

-- A hundred thousand lemmings can't be wrong!
Morons, all of 'em. by Xzzy · 2000-09-14 02:08 · Score: 5

Gee, I wonder how many problems in the world could be solved if people put out a little bit of effort into communicating with each other. Rather than asking Google what's up.. the guys in the story yesterday put MONTHS of effort into proving how they're getting shafted by Google's search engine. They make accusations.

Google hears about it via Slashdot, and in less than 24 hours, the real reason is revealed.

Kinda makes me wonder at humanity, when we're all so locked into our own little shells that we occupy ourselves trying to prove something that five minutes of talking could solve. Sort of like how most Americans never say hello to their neighbor, and can live next to them for years without ever exchanging niceties.
Robot Exclusion Protocol by Anonymous Coward · 2000-09-14 02:10 · Score: 3

The robot exclusion protocol (http://info.webcrawler .com/mak/projects/robots/norobots.html is a way for websites to tell robots what they shouldn't be crawling. When a robot wants to crawl http://foo.bar.com/ it will first fetch http://foo.bar.com/robots.txt. If that file does NOT exist, that is taken to mean implicit permission to crawl anything it can find on that site. If it does exist, then the patterns contained in it are used to restrict what portions of that site are crawled. Every site has its own robots.txt (or lack thereof). To look at Yahoo's robots.txt, just point your browser to http://www.yahoo.com/robots.txt.

If a site has a robots.txt that is telling the robots not to crawl, they have no business yelling at search engines when their pages don't show up.
ok now say your sorry everyone by Emugamer · 2000-09-14 01:49 · Score: 3

After reading all those great flames from yesterday I think this is a good time to apologize. simple mistake, no conspiracy. now show them that you are human and admit you were in error!

----------
Geeks make mistakes to!
Re:robots.txt ? by don_carnage · 2000-09-14 01:50 · Score: 5

What's interesting is that sometimes people look for the robots.txt file to find hidden directories on a server. Hmmm... /journal, /naked_school_girls, /personal_finances...

--

--
Wooden armaments to battle your imaginary foes!
Reasonable expectation of privacy? by FallLine · 2000-09-14 02:23 · Score: 4

Do you really believe you have a reasonable expectation of privacy? You put it online for the world to see. Just because some parties are a little more interested than others doesn't mean they're violating your privacy.

As for searching beyond the request of robots.txt's and _really aggressively_ searching, that strikes me as being something of a different issue. It seems to me that robots.txt is more of a practical and protectionary issue, than it is one of privacy. It's more of a request not to bother you, than it is a request for privacy, at least in my opinion. Also, failure to adequately process and obey robots.txt can easily be the fault of programming error or ignorance, not necessarily a willful or particularly unreasonable act--one need not neccessarily take special measures to circumvent its intention.

This is not to say that I can't sympathize with parties that get hammered by such spiders, but I don't believe the privacy argument per se holds any water. I see legitimate complaints on both sides of the issue. For instance, let's say you're a software company and you find a LINKED and self-proclaimed warez page, but the hosting site doesn't allow spidering. Is that still so criminal? Even if the desire is to simply catalogue and document all of it?
Partial retraction from MedWebPlus by Frac · 2000-09-14 02:23 · Score: 5

Here's a partial retraction from MedWebPlus: (they admit they know now why their rankings dropped, but they still question why Yahoo is on the rise)
http://www.lib.uiowa.edu/hardin/md/ notes7a.html
Explanation why robots.txt file affects ordering by bkosse · 2000-09-14 02:26 · Score: 4

It's actually pretty simple, really. The reason the site in question would have plummeted is that as Google is updating its stats, it probably makes some allowances for screwups and inability to reach a given site. However, after a time, the fact that Google was not allowed to search the page must have some sort of impact, and probably an exponential one. "OK, not here, probably a screw up, but we can't verify the search terms will be there" happened at the beginning and eventually as it aged out of relevence, it became "Well, lots of people think this page is good, but it's just not there!" from Google's perspective.
That makes sense.
Now, we know Google weights other sites by the weight of the site that links them. As the original directory started sliding, anything it linked to starts sliding as well. Which means Yahoo! fills the void. Particularly in such a specialized example where your liklihood of getting a good match is based on a few key sites.

--
Ben Kosse

--

--
Ben Kosse
Remember Ed Curry!
Re:robots.txt by baywulf · 2000-09-14 01:53 · Score: 3

A robots.txt file is used to control web page indexing done by autonomous search engines. It states which search engines are allowed and what they may index. It is somewhat advisory in nature in that a rogue search engine may disregard that information and do what they please but they may suffer the wrath of the owners of that website or others if this is done too often.
The Implications being... by Christopher+B.+Brown · 2000-09-14 02:29 · Score: 3
- ... The real findings that a research project would represent answers to the question: What search engines ignore the robots.txt file? so that any "inclusions" represent either:
  - Search engines that don't respider very often, thus providing obsolete data, or
  - Search engines that ignore requests not to spider, and that thus are bad Internet "citizens."
- ... That this was a very successful "troll" for discussion on the part of both the research group as well as the operators of Slashdot.
  After all, if there was no "crime" to complain about, and any "damage" was done by themselves to themselves, this never merited one story let alone two.
  Since no lawyers were involved, it's not a case where "the lawyers won" (as is often seen in big, bloody trials); instead, it could be said that "the journalists won," as they got a bunch of blather out of no real story.
--
If you're not part of the solution, you're part of the precipitate.
Re:So what about yahoo? by kaphka · 2000-09-14 02:42 · Score: 3

Considering that Yahoo! is compiled by humans, not robots, it would be kind of insulting to expect them all to "parse" robots.txt.

--
MSK
Re:robots.txt by don_carnage · 2000-09-14 01:56 · Score: 3

The robots.txt file is used at the web-root to prevent search engines from indexing certain parts of your website -- not the whole site all-together.
See this link for more information.

--

--
Wooden armaments to battle your imaginary foes!
what good is a robots.txt nowadays... by 2quam4 · 2000-09-14 02:00 · Score: 4

Even with robots.txt utilizing:
User-agent: *
Disallow: /
I continue to receive spidering from companies such as NetCurrents and Cyvelliance because it is easy to ignore robots.txt. Rude, yes -- but easy. It is also easy for me to deny access via Apache, but bots from companies such as the above mentioned continue aggressive spidering.

It seems that standards (such as those for robots.txt) are useless, particularly for companies who spider the Net in search of copyright/trademark violations.

Granted, some companies have an interest in policing their products, but when do they go too far? Wouldn't deliberate/aggressive spidering into areas of my site which I have instituted restrictions/blocking constitute some sort of invasion of privacy? If a government entity is doing the spidering, wouldn't a search warrant be required?