Matt Cutts has debunked this story, and Google's AdWords team has also posted to their blog to debunk this.
I think it's funny that people beat up on Google for buying ads, when Yahoo just takes the screen real estate for free. Try a search for [online advertising] on Yahoo. They hard-code a shortcut to their own products.
If you dig deeper, it turns out that Google emailed talkorigins.org to alert the site that it had been hacked and was stuffed with rape and animal porn spam. Google's head of webspam has posted a full write-up.
Fair points. In my experience at Google, we try to crawl at least a little bit from any site that might prove useful, but PageRank is also a large factor in crawling, so that helps to avoid infinite spaces.
I do think you make an important distinction between crawling and indexing though, because they don't have to be identical. Anyway, if you did the study--nice job. We all enjoyed reading it at Google.:)
I really enjoyed the crawl analysis on drunkmenworkhere.org (and lots of Googlers enjoyed reading it), but I wouldn't necessarily agree that Slurp is the deepest crawling in the general case. In the case of a (mostly) empty website with a huge binary tree under it, quite a few of my colleagues would argue that it's good to have some pages from the tree, but that you don't want to have too many. Most of the pages were pretty empty other than the spambot comments, so you could argue that it might be better to crawl fewer of those low-content pages.
Look at it from other perspective: here's another infinite spider trap: http://infinitetree.com/big-tree/node/01011000110/
A good index selection algorithm would probably notice the near-duplicate nature of many of these pages, and select only a sampling for indexing. Crawling too many of those low-content pages would be a bad idea.
The original story isn't true. A url went live and Google indexed it 3-4 days later. The people who wrote about Charlie Sheen and 9/11 assumed that because it didn't appear faster, Google was somehow making a statement about Charlie Sheen and the 9/11 story. Nope. Google didn't do anything special (good or bad) for that url. Sometimes it takes time for a search engine to crawl/index/return pages.
I haven't checked out the newest claim, but given the typos on the page you point to, such as "apparant" and "its" (needs an apostrophe) and "supressing" I'm inclined to be skeptical (again). For example, news stories exit the Google News index after a few weeks as part of normal operations. Also, just because prisonplanet.com is a news source doesn't mean that infowars.com is a news source. And even if prisonplanet.com is a news source, that doesn't mean that Google knows how to index articles on every section of prisonplanet.com. It's unclear exactly what that url is claiming, but I would first look for non-tinfoil-hat explanations, such as the three that I just mentioned.
Shane, you need to put this up on a blog somewhere; that's pretty funny. The whole notion of sending a noname proposal and a bigname proposal is really wild..
If you want to get to the main version of Google and skip the localized version, here's a page that describes how to do it without allowing cookies.
http://www.tech-recipes.com/google_tips769.html
Hope that helps; if you want to search without a cookie, that's fine with us.:)
GoogleGuy
"Nobody has proved what little factual information they [CNET] conveyed was false."
Not so--CNET admits it themselves. If you read the article in question, CNET did correct false information from their original article. CNET added this disclaimer after they wrote the article:
"Correction: The original article incorrectly implied that Google Desktop Search can track what's stored on a user's PC. The service does not expose a user's content to Google or anyone else without the user's explicit permission."
I can see both sides of this issue; I just wanted to point out that CNET did imply incorrect things in their privacy article; points to them for adding the correction afterwards.
So if I understand the Yahoo News story, you can go to http://code.google.com/vlc-diff.txt, around line 389, and comment that out. The patch presumably just makes the change in the executable to comment this out, yah?
I checked into this. It turns out that the product name is 5062AF, and the page you wanted only has "5062AF" on the page--not 5062 by itself surrounded by whitespace. If you do the query
[concord 5062af] then the page that you wanted shows up at #2, after a PC Magazine review, which would be a pretty solid result too.
It's interesting to think about indexing 5062af as 5062 as well, but some searches would probably become less precise because we added in more general matches.
We're also offering a python2.2 program that will run on your computer and generate a Sitemap for you. I think that's what has the Creative Commons license. Google wants Sitemaps to be open/available to anyone that's interested in creating or using them (including other search engines, if they're interested).
I think the Sitemaps links to a "normal" webserver, as opposed to our custom setup. Plus the Sitemaps stuff is using https. Looks like a higher amount of interest than a typical Slashdotting too. I alerted the Sitemaps team, but you may have to wait for the techie stampede to subside.:)
There's python2.2 code to generate Sitemaps for people. I believe that's what was released under Creative Commons. The intent is to make this open and wide available to anyone that wants to use it.
Thanks for mentioning this. I forwarded your url within Google, and someone is investigating. They're checking for agencies or spots that we're using for the Google Desktop Search.
Thanks again--I appreciate you noticing this.
GoogleGuy
Huh. I guess the sharks with frickin' lasers on their heads must have let their guards down.
Seriously, every search engine does evaluation on their results. It's a good way to test that relevance is high, especially in different languages and locations. The fact that Google does lots of testing and evaluation of our results in tons of different ways shouldn't be a surprise. That's part of the 70-30 breakdown where ~70% of our effort is on the core areas of search and advertising, but we usually don't talk about that 70% work to improve our results or validate their quality. So keep it quiet; I hear some other search engines read Slashdot too.;)
The 20% projects work well in my experience. Sometimes you have to take the initiative to make sure you take that time, but you usually end up doing fun, search-y type stuff. And you end up meeting other people from different parts of Google, and getting familiar with new/different bits of the Google code base. It's also a good way to break out of a rut and make sure that you think about "bigger picture" issues. If you end up crunching on an important project, you can also bank that 20% time and use it up later.
claus, I'm glad that you mentioned this search. I looked through those 100 results. Every example that I saw in those results was from a while ago--they were all listed with the Supplemental Result tag. So this is already handled correctly in our main index, and as urls are updated in the supplemental index, those examples should be handled correctly as well.
Thanks for mentioning this search; it's a good point. We've already made some changes to improve our heuristics, and you can see that improvement in the fact that current urls look better than the supplemental urls.
allinurl:foo.com says "show me all the results you know of that have foo.com in the url." And since com is a stopword, I wouldn't be surprised if this really just said "show me all the results with foo in the url"--that is, without the.com. You could force the.com to also match by using allinurl:foo-com to make it a phrase match, I believe.
So bar.com/dir1/dir2/foo.com would be a valid result for that search, for example. But that doesn't mean that we've confounded bar.com with foo.com. bar.com may do a 302 to foo.com or it may not, but it's not a hijacking. We're just showing all the results we know of with "foo.com" in the url; the fact that some of those results are not on foo.com isn't really a problem. Now if you did site:foo.com and saw results from bar.com, that's something that I would email to us.
Different folks often hit different data centers because of load balancing and stuff like that. I'll certainly keep an eye on this search myself too though.
You bet. If you want to make sure that we have the info to check it out, you can go to google.com/support and when you get to a form where you can enter info, just use canonicalpage as the subject line. We are collecting data to user support to build up a testset for checking any changes we want to try.
It's me. I've had the GoogleGuy handle since Jan 19th, 2005. From the K5 article, the allinurl: stuff isn't true though; allinurl: just looks for term in the url. So [allinurl:imatix.com] can show results from any site that has imatix in the url.
Matt Cutts has debunked this story, and Google's AdWords team has also posted to their blog to debunk this. I think it's funny that people beat up on Google for buying ads, when Yahoo just takes the screen real estate for free. Try a search for [online advertising] on Yahoo. They hard-code a shortcut to their own products.
If you dig deeper, it turns out that Google emailed talkorigins.org to alert the site that it had been hacked and was stuffed with rape and animal porn spam. Google's head of webspam has posted a full write-up.
Fair points. In my experience at Google, we try to crawl at least a little bit from any site that might prove useful, but PageRank is also a large factor in crawling, so that helps to avoid infinite spaces.
:)
I do think you make an important distinction between crawling and indexing though, because they don't have to be identical. Anyway, if you did the study--nice job. We all enjoyed reading it at Google.
I really enjoyed the crawl analysis on drunkmenworkhere.org (and lots of Googlers enjoyed reading it), but I wouldn't necessarily agree that Slurp is the deepest crawling in the general case. In the case of a (mostly) empty website with a huge binary tree under it, quite a few of my colleagues would argue that it's good to have some pages from the tree, but that you don't want to have too many. Most of the pages were pretty empty other than the spambot comments, so you could argue that it might be better to crawl fewer of those low-content pages.
/
Look at it from other perspective: here's another infinite spider trap:
http://infinitetree.com/big-tree/node/01011000110
A good index selection algorithm would probably notice the near-duplicate nature of many of these pages, and select only a sampling for indexing. Crawling too many of those low-content pages would be a bad idea.
The original story isn't true. A url went live and Google indexed it 3-4 days later. The people who wrote about Charlie Sheen and 9/11 assumed that because it didn't appear faster, Google was somehow making a statement about Charlie Sheen and the 9/11 story. Nope. Google didn't do anything special (good or bad) for that url. Sometimes it takes time for a search engine to crawl/index/return pages.
I haven't checked out the newest claim, but given the typos on the page you point to, such as "apparant" and "its" (needs an apostrophe) and "supressing" I'm inclined to be skeptical (again). For example, news stories exit the Google News index after a few weeks as part of normal operations. Also, just because prisonplanet.com is a news source doesn't mean that infowars.com is a news source. And even if prisonplanet.com is a news source, that doesn't mean that Google knows how to index articles on every section of prisonplanet.com. It's unclear exactly what that url is claiming, but I would first look for non-tinfoil-hat explanations, such as the three that I just mentioned.
I posted elsewhere on this thread, but if you want to give a couple examples of searches that didn't work well, I'll ask someone to check them out..
If you want to post some specific examples of poor search results, I'd be happy to pass them on for someone to check out.
Shane, you need to put this up on a blog somewhere; that's pretty funny. The whole notion of sending a noname proposal and a bigname proposal is really wild..
If you want to get to the main version of Google and skip the localized version, here's a page that describes how to do it without allowing cookies. :)
http://www.tech-recipes.com/google_tips769.html
Hope that helps; if you want to search without a cookie, that's fine with us.
GoogleGuy
Nice. Very nice. :)
So if I understand the Yahoo News story, you can go to http://code.google.com/vlc-diff.txt, around line 389, and comment that out. The patch presumably just makes the change in the executable to comment this out, yah?
We definitely do read Slashdot. I'm sure the Google Scholar folks will read the original article, plus the comments here.
I checked into this. It turns out that the product name is 5062AF, and the page you wanted only has "5062AF" on the page--not 5062 by itself surrounded by whitespace. If you do the query [concord 5062af] then the page that you wanted shows up at #2, after a PC Magazine review, which would be a pretty solid result too.
It's interesting to think about indexing 5062af as 5062 as well, but some searches would probably become less precise because we added in more general matches.
We're also offering a python2.2 program that will run on your computer and generate a Sitemap for you. I think that's what has the Creative Commons license. Google wants Sitemaps to be open/available to anyone that's interested in creating or using them (including other search engines, if they're interested).
I think the Sitemaps links to a "normal" webserver, as opposed to our custom setup. Plus the Sitemaps stuff is using https. Looks like a higher amount of interest than a typical Slashdotting too. I alerted the Sitemaps team, but you may have to wait for the techie stampede to subside. :)
There's python2.2 code to generate Sitemaps for people. I believe that's what was released under Creative Commons. The intent is to make this open and wide available to anyone that wants to use it.
Thanks for mentioning this. I forwarded your url within Google, and someone is investigating. They're checking for agencies or spots that we're using for the Google Desktop Search.
Thanks again--I appreciate you noticing this.
GoogleGuy
Huh. I guess the sharks with frickin' lasers on their heads must have let their guards down.
;)
Seriously, every search engine does evaluation on their results. It's a good way to test that relevance is high, especially in different languages and locations. The fact that Google does lots of testing and evaluation of our results in tons of different ways shouldn't be a surprise. That's part of the 70-30 breakdown where ~70% of our effort is on the core areas of search and advertising, but we usually don't talk about that 70% work to improve our results or validate their quality. So keep it quiet; I hear some other search engines read Slashdot too.
I'll pass on the feedback--thanks for mentioning it.
The 20% projects work well in my experience. Sometimes you have to take the initiative to make sure you take that time, but you usually end up doing fun, search-y type stuff. And you end up meeting other people from different parts of Google, and getting familiar with new/different bits of the Google code base. It's also a good way to break out of a rut and make sure that you think about "bigger picture" issues. If you end up crunching on an important project, you can also bank that 20% time and use it up later.
claus, I'm glad that you mentioned this search. I looked through those 100 results. Every example that I saw in those results was from a while ago--they were all listed with the Supplemental Result tag. So this is already handled correctly in our main index, and as urls are updated in the supplemental index, those examples should be handled correctly as well.
Thanks for mentioning this search; it's a good point. We've already made some changes to improve our heuristics, and you can see that improvement in the fact that current urls look better than the supplemental urls.
allinurl:foo.com says "show me all the results you know of that have foo.com in the url." And since com is a stopword, I wouldn't be surprised if this really just said "show me all the results with foo in the url"--that is, without the .com. You could force the .com to also match by using allinurl:foo-com to make it a phrase match, I believe.
So bar.com/dir1/dir2/foo.com would be a valid result for that search, for example. But that doesn't mean that we've confounded bar.com with foo.com. bar.com may do a 302 to foo.com or it may not, but it's not a hijacking. We're just showing all the results we know of with "foo.com" in the url; the fact that some of those results are not on foo.com isn't really a problem. Now if you did site:foo.com and saw results from bar.com, that's something that I would email to us.
Different folks often hit different data centers because of load balancing and stuff like that. I'll certainly keep an eye on this search myself too though.
You bet. If you want to make sure that we have the info to check it out, you can go to google.com/support and when you get to a form where you can enter info, just use canonicalpage as the subject line. We are collecting data to user support to build up a testset for checking any changes we want to try.
It's me. I've had the GoogleGuy handle since Jan 19th, 2005. From the K5 article, the allinurl: stuff isn't true though; allinurl: just looks for term in the url. So [allinurl:imatix.com] can show results from any site that has imatix in the url.