NCSA Issues Disclaimer on Google/Yahoo Study

← Back to Stories (view on slashdot.org)

NCSA Issues Disclaimer on Google/Yahoo Study

Posted by Hemos on Monday August 22, 2005 @04:15AM from the point-counterpoint dept.

Jean Veronis writes "NCSA has issued a strong disclaimer on the study announced recently on Slashdot that seemed to contradicted the fact that Yahoo's index size would be bigger than Google's: ' Staff at the NCSA noted several issues with the study'. This study conducted by students is 'not an NCSA publication and was not conducted as part of any NCSA project or under the supervision of NCSA'. "

4 of 118 comments (clear)

Min score:

Reason:

Sort:

Disclaimer Text by Stanistani · 2005-08-22 04:24 · Score: 5, Interesting

From http://vburton.ncsa.uiuc.edu/indexsize.html:
"The following study was completed by two of Professor Vernon Burton's students at the University of Illinois. Though one of the students previously worked with Professor Burton at the National Center for Supercomputing Applications (NCSA), the study was done outside the scope of any NCSA core projects. When first published online, staff at the NCSA noted several issues with the study, and some revisions have been made to the document to reflect several of these concerns. Changes are detailed at the bottom of the following page.

Please note again that this study is not an NCSA publication and was not conducted as part of any NCSA project or under the supervision of NCSA.

A Comparison of the Size of the Yahoo and Google Indices "

--
You can't talk about Wikipedia's flaws on Wikipedia
Maybe those pages never were crawled by yahoo. by MushMouth · 2005-08-22 04:51 · Score: 4, Interesting

It is entirely possible that Yahoo not only crawls more content on a whole, but the percentage of the content that they crawl that is spam is smaller. Crawlers get to spam/SEO pages via other spam/SEO pages, thus if they have better filtering mechanisms they may simply stop earlier and put their (not completely unlimited) resources into crawling more useful data.
Accuracy of Google counts? by xiaomonkey · 2005-08-22 05:58 · Score: 5, Interesting
Try the following sets of key words on Google:
- lawyer - results 29,300,000
- lawyer lawyer - results 29,300,000
- lawyer lawyer lawyer - results 62,000,000
- lawyer lawyer lawyer lawyer - results 78,600,000
This trend appears to continue, as seen in that repeating the "lawyer" keyword 10 times results in Google estimating that there are 389,000,000 hits in it's index.

On yahoo, this sort of thing doesn't seem to happen as much, but it still does happen. For example, searching for "laywer" returns 124,000,000 results, and searching for "lawyer lawyer" or "lawyer lawyer" returns 125,000,000 results.

So, it probably doesn't really make seen to judge the relative size of either index based on the estimated number of hits for any given set of keywords in their index. Right now, Google's numbers look a little more suspect since they seem to variety so greatly just based on the repetition of a keyword. However, the stability of Yahoo's numbers don't necessarily mean that they're correct either.
More thoughts on a better test by freality · 2005-08-22 06:23 · Score: 4, Interesting

After criticising the study when I first saw it, I now have some constructive ideas on how to perform a better test of the relative search performance of the two engines.

- Crawler Test

Do a search of "microsoft site:microsoft.com" in Google, Yahoo and MSN search. Assume that Microsoft knows how to crawl its own site completely and judge the relative strength of Google's and Yahoo's crawlers based on how many of those pages they find. Google easily wins.

Unfortunately, his test doesn't work with Yahoo as the reference site since Google returns no hits for "yahoo site:yahoo.com". This is very disappointing, as Yahoo is one of the largest sites on the Web. Neither Yahoo or MSN share this self-censorship policy.

Another test of the same kind is "amazon site:amazon.com", since amazon is a very big site which everyone is presumably very interested in crawling well. Unfortunately, amazon doesn't allow this kind of search on themselves via their A9 engine.

This is an interesting test because it compares the very likely actual size of a site (i.e. Microsoft's reported size for microsoft.com) with the reported size by the second-party search engine. This may be the best 3rd party test of a crawler's Recall on a per-site basis.

- Common Word Test

Surprisingly, Google, Yahoo and MSN all now allow stop-word searching. Stop words are words like "the", "a", "it", etc.. A search for "the" on each show Yahoo significantly in the lead.

This is an interesting test because the word "the" is probably as uniformly distributed in the source webpage population than the random phrases used in the NCSA test (or possibly moreso), plus this word is more likely than any other to occur in every web document (at least with English). These two characteristics mean that finding the most pages with "the" may be the closest approximation of an actual Recall measurement for all sites on the web that can be done without a prior-knowledge testing set.

- Conclusion

Google and MSN seem to have high-quality crawlers, whereas Yahoo makes up with a much larger current index size. However, take this with a grain of salt as it's hard to measure anything without knowing how the sites do Precision/Recall tradeoffs, and there may be substantial structural differences in the way the sites index pages to start with.

Even assuming these results, Yahoo's bigger index isn't necessarily more desirable. A better crawler can yield a bigger and better index with relatively little work (just add more machines to store it on), while there is no easy fix for the meticulous work needed to ensure a crawler is getting to all the nooks and crannies of the Web.

Longer-term, it is these caveats that make an Open-Source approach to crawling shine by comparisson. Take for example the Nutch search engine. Though it is in its early days, there need be no doubt to its cralwing and ranking algorithms as well as its Precision/Recall tradeoff.