NCSA Issues Disclaimer on Google/Yahoo Study

← Back to Stories (view on slashdot.org)

NCSA Issues Disclaimer on Google/Yahoo Study

Posted by Hemos on Monday August 22, 2005 @04:15AM from the point-counterpoint dept.

Jean Veronis writes "NCSA has issued a strong disclaimer on the study announced recently on Slashdot that seemed to contradicted the fact that Yahoo's index size would be bigger than Google's: ' Staff at the NCSA noted several issues with the study'. This study conducted by students is 'not an NCSA publication and was not conducted as part of any NCSA project or under the supervision of NCSA'. "

11 of 118 comments (clear)

Min score:

Reason:

Sort:

Disclaimer Text by Stanistani · 2005-08-22 04:24 · Score: 5, Interesting

From http://vburton.ncsa.uiuc.edu/indexsize.html:
"The following study was completed by two of Professor Vernon Burton's students at the University of Illinois. Though one of the students previously worked with Professor Burton at the National Center for Supercomputing Applications (NCSA), the study was done outside the scope of any NCSA core projects. When first published online, staff at the NCSA noted several issues with the study, and some revisions have been made to the document to reflect several of these concerns. Changes are detailed at the bottom of the following page.

Please note again that this study is not an NCSA publication and was not conducted as part of any NCSA project or under the supervision of NCSA.

A Comparison of the Size of the Yahoo and Google Indices "

--
You can't talk about Wikipedia's flaws on Wikipedia
... so? by DrEldarion · 2005-08-22 04:24 · Score: 2, Interesting

Okay, changes have been made to it, but the outcome is still the same. Why does this matter, then?
1. Re:... so? by Anonymous Coward · 2005-08-22 04:48 · Score: 2, Interesting
  
  Preliminary results (from 7000 test queries) indicates that the results of this verification study confirms the conclusions of this study, but final results are still forthcoming.
  
  Looks like they're still doing some looking to make sure their results are rock solid, but that so far they seem to be. As such, the current state of reality is that the fact is that Google has a must bigger index of the world wide web (or Internet, or whatever you want to call it) than Yahoo. Yahoo may have a bigger index squirreled away somewhere, but they are not making it available to the public via their search. As such, their CEO's (mis)statements about such are very much misleading and in dispute. Anyways, as anyone who searches knows, Google is better.*
  
  --
  * The preceding was my opinion of the facts. YMMV. Please search responsibly.
2. Re:... so? by mi · 2005-08-22 05:16 · Score: 1, Interesting
  
  The whole method seems flawed. Trying to compare the sizes of two sets by the sizes of various subsets makes sense only if the method of selecting the subsets is the same.
  This is not the case. The methods depend on each search engine's algorithms and are very likely to differ greatly.
  In any case, whether a particular query returns 40 results or 40000 does not matter -- only the first 20 are ever of any use...
  
  --
  In Soviet Washington the swamp drains you.
/. 503 error by dhasenan · 2005-08-22 04:25 · Score: 2, Interesting

Off topic...

Anyone else get 503 errors when trying to reach Slashdot?

Where do you go to talk about Slashdot being Slashdotted?
Why is the disclaimer needed? by frdmfghtr · 2005-08-22 04:48 · Score: 2, Interesting

I didn't read the article the first time around, so maybe something was changed/removed that prompted the disclaimer. I read the report and couldn't find a single reference to NCSA, except in the URL and in the disclaimer itself.

Aside from the URL, was there some sort of NCSA association implied or claimed in the original post then removed?

--
Government's idea of a balanced budget: take money from the right pocket to balance...oh who am I kidding?
Maybe those pages never were crawled by yahoo. by MushMouth · 2005-08-22 04:51 · Score: 4, Interesting

It is entirely possible that Yahoo not only crawls more content on a whole, but the percentage of the content that they crawl that is spam is smaller. Crawlers get to spam/SEO pages via other spam/SEO pages, thus if they have better filtering mechanisms they may simply stop earlier and put their (not completely unlimited) resources into crawling more useful data.
Re:The dark web by markov_chain · 2005-08-22 05:19 · Score: 2, Interesting

Interesting.

The original search rating papers (Kleinberg's algorithm, PageRank) made the ground-breaking observation that links between pages contained lots of useful information that could be used for ranking in addition to the keywords contained in the pages themselves. This was in a time where websites were mostly personal, and there was an atmosphere of friendly sharing of information where people would link to other sites that they find interesting. However, how much have things changed today? Who still has their own web page? Aren't links mostly commercial? Like you said, there may be valuable sites out there that are getting ignored because of over-reliance on links as source of reputation.

Maybe it's time for a "people who went to site A also went to site B" technology. It would require running a client-side traffic monitor that would build these adjacency lists and send them back. If it was open sourced and anonymous, the privacy concerns would be minimal, and it would provide a usage-based source of reputation.

--
Tsunami -- You can't bring a good wave down!
Accuracy of Google counts? by xiaomonkey · 2005-08-22 05:58 · Score: 5, Interesting
Try the following sets of key words on Google:
- lawyer - results 29,300,000
- lawyer lawyer - results 29,300,000
- lawyer lawyer lawyer - results 62,000,000
- lawyer lawyer lawyer lawyer - results 78,600,000
This trend appears to continue, as seen in that repeating the "lawyer" keyword 10 times results in Google estimating that there are 389,000,000 hits in it's index.

On yahoo, this sort of thing doesn't seem to happen as much, but it still does happen. For example, searching for "laywer" returns 124,000,000 results, and searching for "lawyer lawyer" or "lawyer lawyer" returns 125,000,000 results.

So, it probably doesn't really make seen to judge the relative size of either index based on the estimated number of hits for any given set of keywords in their index. Right now, Google's numbers look a little more suspect since they seem to variety so greatly just based on the repetition of a keyword. However, the stability of Yahoo's numbers don't necessarily mean that they're correct either.
More thoughts on a better test by freality · 2005-08-22 06:23 · Score: 4, Interesting

After criticising the study when I first saw it, I now have some constructive ideas on how to perform a better test of the relative search performance of the two engines.

- Crawler Test

Do a search of "microsoft site:microsoft.com" in Google, Yahoo and MSN search. Assume that Microsoft knows how to crawl its own site completely and judge the relative strength of Google's and Yahoo's crawlers based on how many of those pages they find. Google easily wins.

Unfortunately, his test doesn't work with Yahoo as the reference site since Google returns no hits for "yahoo site:yahoo.com". This is very disappointing, as Yahoo is one of the largest sites on the Web. Neither Yahoo or MSN share this self-censorship policy.

Another test of the same kind is "amazon site:amazon.com", since amazon is a very big site which everyone is presumably very interested in crawling well. Unfortunately, amazon doesn't allow this kind of search on themselves via their A9 engine.

This is an interesting test because it compares the very likely actual size of a site (i.e. Microsoft's reported size for microsoft.com) with the reported size by the second-party search engine. This may be the best 3rd party test of a crawler's Recall on a per-site basis.

- Common Word Test

Surprisingly, Google, Yahoo and MSN all now allow stop-word searching. Stop words are words like "the", "a", "it", etc.. A search for "the" on each show Yahoo significantly in the lead.

This is an interesting test because the word "the" is probably as uniformly distributed in the source webpage population than the random phrases used in the NCSA test (or possibly moreso), plus this word is more likely than any other to occur in every web document (at least with English). These two characteristics mean that finding the most pages with "the" may be the closest approximation of an actual Recall measurement for all sites on the web that can be done without a prior-knowledge testing set.

- Conclusion

Google and MSN seem to have high-quality crawlers, whereas Yahoo makes up with a much larger current index size. However, take this with a grain of salt as it's hard to measure anything without knowing how the sites do Precision/Recall tradeoffs, and there may be substantial structural differences in the way the sites index pages to start with.

Even assuming these results, Yahoo's bigger index isn't necessarily more desirable. A better crawler can yield a bigger and better index with relatively little work (just add more machines to store it on), while there is no easy fix for the meticulous work needed to ensure a crawler is getting to all the nooks and crannies of the Web.

Longer-term, it is these caveats that make an Open-Source approach to crawling shine by comparisson. Take for example the Nutch search engine. Though it is in its early days, there need be no doubt to its cralwing and ranking algorithms as well as its Precision/Recall tradeoff.
Study: Red Delicious apples != Fuji apples by RunzWithScissors · 2005-08-22 06:32 · Score: 3, Interesting

I got flamed for proposing this theory when the article was first posted on /.

One major problem with the study, not really addressed by this problems article is one of comparison. Yes, Yahoo! and Google are two search engines, but they perform their searches differently, and more importantly, use different criteria when returning matches! It is quite possible that when doing the exact same search in both yeilds a difference in results. Why? Because the two different search engines have different criteria for providing output, or matches if you will, from said search. Perhaps Yahoo! does indeed have more pages indexed, but because of their search algorithm, or their program which displays results to the user, less matches are provided; even though more pages were looked at.

I'm not saying that Yahoo! does have more or less pages than Google. But the study that was executed and published did not account for many of the differences between the products they were comparing. The above is merely another interpretation of said results. I'm glad that some folks at NCSA agree and provided some clarification; lest we get another urban myth like storks bring babies.

-Runz