Follow Up on Google Favoring Yahoo
After yesterday's story about google favoring Yahoo links, I got word from Sergey Brin from google. He says that the reason that the site tested showed so poorly is that a robots.txt file prevented Google's crawler from fully indexing the site. The robots.txt file has since disappeared, and the next index should show a change in the rankings.
If robots.txt was there, how did Google index the site at all (instead of just poorly)?
<O
( \
XGNOME vs. KDE: the game!
Will I retire or break 10K?
If you don't know what robots.txt is, look at A Method for Web Robots Control Internet RFC...
-- A hundred thousand lemmings can't be wrong!
Gee, I wonder how many problems in the world could be solved if people put out a little bit of effort into communicating with each other. Rather than asking Google what's up.. the guys in the story yesterday put MONTHS of effort into proving how they're getting shafted by Google's search engine. They make accusations.
Google hears about it via Slashdot, and in less than 24 hours, the real reason is revealed.
Kinda makes me wonder at humanity, when we're all so locked into our own little shells that we occupy ourselves trying to prove something that five minutes of talking could solve. Sort of like how most Americans never say hello to their neighbor, and can live next to them for years without ever exchanging niceties.
The robot exclusion protocol (http://info.webcrawler .com/mak/projects/robots/norobots.html is a way for websites to tell robots what they shouldn't be crawling. When a robot wants to crawl http://foo.bar.com/ it will first fetch http://foo.bar.com/robots.txt. If that file does NOT exist, that is taken to mean implicit permission to crawl anything it can find on that site. If it does exist, then the patterns contained in it are used to restrict what portions of that site are crawled. Every site has its own robots.txt (or lack thereof). To look at Yahoo's robots.txt, just point your browser to http://www.yahoo.com/robots.txt.
If a site has a robots.txt that is telling the robots not to crawl, they have no business yelling at search engines when their pages don't show up.
After reading all those great flames from yesterday I think this is a good time to apologize. simple mistake, no conspiracy. now show them that you are human and admit you were in error!
----------
Geeks make mistakes to!
What's interesting is that sometimes people look for the robots.txt file to find hidden directories on a server. Hmmm... /journal, /naked_school_girls, /personal_finances...
--
Wooden armaments to battle your imaginary foes!
Do you really believe you have a reasonable expectation of privacy? You put it online for the world to see. Just because some parties are a little more interested than others doesn't mean they're violating your privacy.
As for searching beyond the request of robots.txt's and _really aggressively_ searching, that strikes me as being something of a different issue. It seems to me that robots.txt is more of a practical and protectionary issue, than it is one of privacy. It's more of a request not to bother you, than it is a request for privacy, at least in my opinion. Also, failure to adequately process and obey robots.txt can easily be the fault of programming error or ignorance, not necessarily a willful or particularly unreasonable act--one need not neccessarily take special measures to circumvent its intention.
This is not to say that I can't sympathize with parties that get hammered by such spiders, but I don't believe the privacy argument per se holds any water. I see legitimate complaints on both sides of the issue. For instance, let's say you're a software company and you find a LINKED and self-proclaimed warez page, but the hosting site doesn't allow spidering. Is that still so criminal? Even if the desire is to simply catalogue and document all of it?
http://www.lib.uiowa.edu/hardin/md/ notes7a.html
It's actually pretty simple, really. The reason the site in question would have plummeted is that as Google is updating its stats, it probably makes some allowances for screwups and inability to reach a given site. However, after a time, the fact that Google was not allowed to search the page must have some sort of impact, and probably an exponential one. "OK, not here, probably a screw up, but we can't verify the search terms will be there" happened at the beginning and eventually as it aged out of relevence, it became "Well, lots of people think this page is good, but it's just not there!" from Google's perspective.
That makes sense.
Now, we know Google weights other sites by the weight of the site that links them. As the original directory started sliding, anything it linked to starts sliding as well. Which means Yahoo! fills the void. Particularly in such a specialized example where your liklihood of getting a good match is based on a few key sites.
--
Ben Kosse
--
Ben Kosse
Remember Ed Curry!
A robots.txt file is used to control web page indexing done by autonomous search engines. It states which search engines are allowed and what they may index. It is somewhat advisory in nature in that a rogue search engine may disregard that information and do what they please but they may suffer the wrath of the owners of that website or others if this is done too often.
After all, if there was no "crime" to complain about, and any "damage" was done by themselves to themselves, this never merited one story let alone two.
Since no lawyers were involved, it's not a case where "the lawyers won" (as is often seen in big, bloody trials); instead, it could be said that "the journalists won," as they got a bunch of blather out of no real story.
If you're not part of the solution, you're part of the precipitate.
Considering that Yahoo! is compiled by humans, not robots, it would be kind of insulting to expect them all to "parse" robots.txt.
MSK
See this link for more information.
--
Wooden armaments to battle your imaginary foes!
Even with robots.txt utilizing: /
User-agent: *
Disallow:
I continue to receive spidering from companies such as NetCurrents and Cyvelliance because it is easy to ignore robots.txt. Rude, yes -- but easy. It is also easy for me to deny access via Apache, but bots from companies such as the above mentioned continue aggressive spidering.
It seems that standards (such as those for robots.txt) are useless, particularly for companies who spider the Net in search of copyright/trademark violations.
Granted, some companies have an interest in policing their products, but when do they go too far? Wouldn't deliberate/aggressive spidering into areas of my site which I have instituted restrictions/blocking constitute some sort of invasion of privacy? If a government entity is doing the spidering, wouldn't a search warrant be required?