Follow Up on Google Favoring Yahoo
After yesterday's story about google favoring Yahoo links, I got word from Sergey Brin from google. He says that the reason that the site tested showed so poorly is that a robots.txt file prevented Google's crawler from fully indexing the site. The robots.txt file has since disappeared, and the next index should show a change in the rankings.
A robot not respecting robots.txt is certainly in the class of unauthorized use. So if Michigan's law catches on across the U.S., maybe there will be some real protection for web admins to protect sites or directories from being indexed. Slap the bot company with a felony! Maybe the law isn't so odious after all.
No, Thursday's out. How about never - is never good for you?
If their robots will not honor your robots.txt then you do not have to honor their robots nor give them useful information. You could detect them and feed them random responses -- either the types of responses which they do like or the types which they don't like. 43,000 links to metallica -- which when an expensive human looks at them will be found to be artwork made with glitter-covered glue...
And this may be where the cause and effect of the Yahoo/Google agreement comes into play. Before there was an agreement between Yahoo and Google, Yahoo would have some reason not to want Google to be spidering their site. After all, you don't want your competitor to take advantage of your hard work. After the agreement, though, they would certainly want Google to spider their site, since they now want to show up as well as possible on Google. The result is that Google is taken off their spiders.txt (and we now know that Google is polite and obeys spiders.txt) and their ranking start shooting up.
There's no point in questioning authority if you aren't going to listen to the answers.
No, it means that Yahoo!'s robots.txt doesn't block crawlers from 100% of their site like MedWeb was doing:
/gnn
/msn
/pacbell
/pb
:-) ). Google apparently wasn't indexing their whole site for some other reason, now resulting from the new agreement, they are indexing 100%
http://www.yahoo.com/robots.txt
User-agent: *
Disallow:
Disallow:
Disallow:
Disallow:
# Rover is a bad dog <http://www.roverbot.com>
User-agent: Roverbot
Disallow: /
So they let just about anybody index most of their their site, except for the listed exceptions (except roverbot, he is a bad dog
The presence of a robots.txt file doesn't block crawlers by default. The bots are supposed to look at the contents of robots.txt and follow the rules.
There is much cruelty in the universe, John.
Yeah, we seem to have the tour map.
I know not everyone knows how Search Engines work, and mostly you don't need to know. Everyone who has a page on the web should read this though " A Standard for Robot Exclusion ". Its been a standard since 30 June 1994 and thats not bad for an Internet standard.
I assure you that Google.com follows it to the letter. All the main SEs do.. if they didn't they might even be leaving themselves open to legal challenges. Read the old mailing lists at Webcrawler (search for "robots.txt" on google) and you'll see that people used to get quite wound up by rude SEs back in 94. A Web server's CPU time was worth something then.
As for all the lone gunmen out there cooking up theories...read this. Google has ALREADY sold the top links for some keywords. They don't hide it, read the FAQ on their site and you'll find the address to write to to buy listings. Maybe you should read the Demographics. Your the market being sold. Seems fair to me.
The actual search results (not the adverts) are genuine and not sold. Makes sense... consider the whole Google model (who links to you affects your ranking) and its clear Yahoo, Disney etc will all rank very highly. Lots of links into them because they are quality sites.
I've done a lot of work with SEs over the years and Google is far more genuine than anyone else in the market, but they have to make ends meet.
Take a look at this also. Can we spot the paid for listings yet?
0daymeme.com: Great stuff.
If robots.txt was there, how did Google index the site at all (instead of just poorly)?
<O
( \
XGNOME vs. KDE: the game!
Will I retire or break 10K?
some of these poorly written programs check the robots.txt file every 5 minutes when they're in a spidering mood. Nice. You've got to wonder how much bandwidth is wasted due in part to moronic programming practice.
Many spiders (e.g. Googlebot) are distributed among many colocated boxen so they can get better network performance. Each box needs its own copy of robots.txt so it can choose whether or not to index pages and follow links. Read your server logs again; are all the robots.txt hits from the same IP address, or are they from different machines?
<O
( \
XGNOME vs. KDE: the game!
Will I retire or break 10K?
Uhm, because it was a troll?
Neopets - the best free game on the Int
If you don't know what robots.txt is, look at A Method for Web Robots Control Internet RFC...
-- A hundred thousand lemmings can't be wrong!
Gee, I wonder how many problems in the world could be solved if people put out a little bit of effort into communicating with each other. Rather than asking Google what's up.. the guys in the story yesterday put MONTHS of effort into proving how they're getting shafted by Google's search engine. They make accusations.
Google hears about it via Slashdot, and in less than 24 hours, the real reason is revealed.
Kinda makes me wonder at humanity, when we're all so locked into our own little shells that we occupy ourselves trying to prove something that five minutes of talking could solve. Sort of like how most Americans never say hello to their neighbor, and can live next to them for years without ever exchanging niceties.
The robot exclusion protocol (http://info.webcrawler .com/mak/projects/robots/norobots.html is a way for websites to tell robots what they shouldn't be crawling. When a robot wants to crawl http://foo.bar.com/ it will first fetch http://foo.bar.com/robots.txt. If that file does NOT exist, that is taken to mean implicit permission to crawl anything it can find on that site. If it does exist, then the patterns contained in it are used to restrict what portions of that site are crawled. Every site has its own robots.txt (or lack thereof). To look at Yahoo's robots.txt, just point your browser to http://www.yahoo.com/robots.txt.
If a site has a robots.txt that is telling the robots not to crawl, they have no business yelling at search engines when their pages don't show up.
If any of you web admin gurus (I know you're reading) have any ideas of how to deal with these programs, I could really use some help. I'd like to detect them and feed them the files at a controlled bandwidth.
I find these archiver programs usually (but not always) behave much worse than any robot... often times they completely saturate my bandwidth for many minutes. Not nice.
PJRC: Electronic Projects, 8051 Microcontroller Tools
You could build a trap for such crawlers in the form of randomly generated HTML documents which each reference a few more fake URLs which generate more random HTML documents... Disallow that tree in your robots.txt and let the robots who disregard it suffer.
:-)
The best random document generator would be a Markov chainer which had been feed all of the top level category pages from Yahoo! to make sure you have lots of juicy keywords to index.
This may be a trivial question, but I'd really like an answer:
/web/foo/bar and then index the page opus.html, even though neither the directory nor the file are referenced or mentioned in any of the "public" files?
Can search engines find and index pages (html, php, etc.) that are not explicitly linked from the starting index.*htm* page in a given directory?
Put another way, can a search engine find my directory
I ask because I was using non-referenced pages (can only be found by knowing the address) as part of a way to limit access to certain files to specific people.
I hope someone can provide some insight into this issue.
Thanks
-----
D. Fischer
ShoutingMan.com
From the original article's author's "partial retraction":
"...Google reportedly says that they are now crawling *all* of Yahoo! as part of their agreement..."
http://www.lib.uiowa.edu/hardin/md/notes7a.html
No real big mystery, Google wasn't indexing all of Yahoo's content before for some reason, now they are. If Yahoowent to all the trouble of pushing a pile of money at Google to be their search engine, why wouldn't they expect them to index all of their content?
There is much cruelty in the universe, John.
Yeah, we seem to have the tour map.
"Do you really believe you have a reasonable expectation of privacy? You put it online for the world to see." That's an interesting point, which I would have agreed with a few months back. Now that I have my own website, though, my attitude has changed. In my mind, I have leased a service by which I can make materials available to various people via a global computer network. That means that I have the right to restrict who sees what. The majority of my online info is freely available for the world to see. But there is information that is meant for a specific group of people. Thus, I've given the URL to only those people who should have access. Some of it password protected as well. Could certain unsavory types get to that info, despite my precautions? Probably, but I don't that think that merely putting it on an online computer automatically gives them that right. <Bad Analogy>I lease an apartment which is visible to the world, and anyone can access the foyer. But that does not implicitly confer the right for anyone to enter my apartment and go through my belongings. And just because anyone can get into the foyer doesn't mean that have the right to read my magazines that are there because they don't fit in the mailboxes. If they want access to that material and my belongings, they can call me or 'buzz' me and ask to be let in.</Bad Analogy> Put another way, eavesdropping is bad form even in the online world.
-----
D. Fischer
ShoutingMan.com
Seriously, though, I have a question: If robots.txt was there, how did Google index the site at all (instead of just poorly)?
Got Rhinos?
After reading all those great flames from yesterday I think this is a good time to apologize. simple mistake, no conspiracy. now show them that you are human and admit you were in error!
----------
Geeks make mistakes to!
Search engines (and any webcrawling 'bots') don't index sites where they find a 'robots.txt' file. This is called the Robot Exclusion Principle.
If you run a web site, check your error log for notes to that effect. (you'll get a random bot from, say, 'inktomi' or something, and they'll check for a robots.txt file, they don't find it, you get a message in your error log, and then your site gets crawled...)
---
pb Reply or e-mail; don't vaguely moderate.
pb Reply or e-mail; don't vaguely moderate.
What's interesting is that sometimes people look for the robots.txt file to find hidden directories on a server. Hmmm... /journal, /naked_school_girls, /personal_finances...
--
Wooden armaments to battle your imaginary foes!
I doubt it. You have to take the attitude that if you have something on an open webserver, people can see it. If you dont want a spider hitting your site, ban the subnet that it comes from. If the data is something you dont want the government or anyone else to see, dont place it in plain view.
So fine, they didn't do that - now explain why Yahoo's rankings shot UP? I heard a few plausible and non-evil theories on how this happened, but I want to hear it from Yahoo.
It's rare that you're presented with a knob whose only two positions are Make History and Flee Your Glorious Destiny.
Do you really believe you have a reasonable expectation of privacy? You put it online for the world to see. Just because some parties are a little more interested than others doesn't mean they're violating your privacy.
As for searching beyond the request of robots.txt's and _really aggressively_ searching, that strikes me as being something of a different issue. It seems to me that robots.txt is more of a practical and protectionary issue, than it is one of privacy. It's more of a request not to bother you, than it is a request for privacy, at least in my opinion. Also, failure to adequately process and obey robots.txt can easily be the fault of programming error or ignorance, not necessarily a willful or particularly unreasonable act--one need not neccessarily take special measures to circumvent its intention.
This is not to say that I can't sympathize with parties that get hammered by such spiders, but I don't believe the privacy argument per se holds any water. I see legitimate complaints on both sides of the issue. For instance, let's say you're a software company and you find a LINKED and self-proclaimed warez page, but the hosting site doesn't allow spidering. Is that still so criminal? Even if the desire is to simply catalogue and document all of it?
http://www.lib.uiowa.edu/hardin/md/ notes7a.html
It's actually pretty simple, really. The reason the site in question would have plummeted is that as Google is updating its stats, it probably makes some allowances for screwups and inability to reach a given site. However, after a time, the fact that Google was not allowed to search the page must have some sort of impact, and probably an exponential one. "OK, not here, probably a screw up, but we can't verify the search terms will be there" happened at the beginning and eventually as it aged out of relevence, it became "Well, lots of people think this page is good, but it's just not there!" from Google's perspective.
That makes sense.
Now, we know Google weights other sites by the weight of the site that links them. As the original directory started sliding, anything it linked to starts sliding as well. Which means Yahoo! fills the void. Particularly in such a specialized example where your liklihood of getting a good match is based on a few key sites.
--
Ben Kosse
--
Ben Kosse
Remember Ed Curry!
Robots can't find things not linked to. /robots.txt.
/robots.txt to find juicy things.
/youfucker, don't link to it anywhere,
/robots.txt, and
/youfucker and feed them to Wpoison.
Good robots obey
Bad robots use
So...
Create
deny access to it explicitly in
install Wpoison, freely available at
http://www.e-scrub.com/wpoison/
Fix your web server to take requests into
Too bad for Mister Bad Robot.
The default (no robots.txt) is to crawl your site. If you have a robots.txt, it follows the rules therein. /robots-txt.html :>
http://www.searchtools.com/robots
List of rules - found with google.
-- perl -e'print pack"H*","6e656d6f406d38792e6f7267"'
After all, if there was no "crime" to complain about, and any "damage" was done by themselves to themselves, this never merited one story let alone two.
Since no lawyers were involved, it's not a case where "the lawyers won" (as is often seen in big, bloody trials); instead, it could be said that "the journalists won," as they got a bunch of blather out of no real story.
If you're not part of the solution, you're part of the precipitate.
yes they do. And Rover is a bad dog.
No, Thursday's out. How about never - is never good for you?
Considering that Yahoo! is compiled by humans, not robots, it would be kind of insulting to expect them all to "parse" robots.txt.
MSK
No. It means that yahoo doesn't have a robots.txt file. Think about it.
Even with robots.txt utilizing: /
User-agent: *
Disallow:
I continue to receive spidering from companies such as NetCurrents and Cyvelliance because it is easy to ignore robots.txt. Rude, yes -- but easy. It is also easy for me to deny access via Apache, but bots from companies such as the above mentioned continue aggressive spidering.
It seems that standards (such as those for robots.txt) are useless, particularly for companies who spider the Net in search of copyright/trademark violations.
Granted, some companies have an interest in policing their products, but when do they go too far? Wouldn't deliberate/aggressive spidering into areas of my site which I have instituted restrictions/blocking constitute some sort of invasion of privacy? If a government entity is doing the spidering, wouldn't a search warrant be required?