NCSA Compares Google and Yahoo Index Numbers
chrisd (former Slashdot editor and now Google employee) writes "Recently, Yahoo claimed an increase of index size to "over 20 billion items", compared to Google's 8.16 billion pages. Now, researchers at NCSA have done their own, independent, comparison of the two engines. "
Honestly, when I first heard the news over the weekend I thought "rubbish, they must be ignoring requests for spiders to go no further or something." I guess NCSA can either 1) Expect no gifts from Yahoo OR 2) Report significantly different results after a sizable gift to NCSA.
75% less truth than other leading brand
A feeling of having made the same mistake before: Deja Foobar
Try searching for the word, "failure" in Google and check the results.
This brings into question *accurate* results. In this case it appears that's left to interpretation.
"Simplify, simplify, simplify!" Thoreau
In other words, they believe Google indexes more items based on their own tests of searching.
Tech, life, family, faith: Give me a visit
TFA notes that queries with greater than 1,000 results were dropped from the survey, because Google and Yahoo both truncate their results to 1,000.
That makes sense, but it does stand to reason (or, at least, to my reason) that these queries that garner large numbers of results could have had a significant impact on the bottom line of the survey.
Those could be the larger sites, where Yahoo is perhaps digging deeper, requesting data from forms, ignoring robots.txt, etc. It could be where they're getting those big claimed numbers of indexed documents.
This is a great article! I wish there were more like it on slashdot. It's scientific instead of an opinion piece, it has references, it's repeatable. It's also short and very readable, unlike a lot of science papers.
OK, it is yet another Google piece, but it's not "some junior analyst predicts Google will buy Apple and release OSX86box 720".
I quit!
Google started treating plurals as the same search about a year ago. Yahoo doesn't. So, if you google for "inkjet printers" and "inkjet printer" you will get the same result set; however, on Yahoo, you will get different results.
The net result is that for the same index size, Google will return more results. (And, IMHO, more meaningful ones.)
Search: Valerie Plame
Google: 908,000
Yahoo: 2,580,000
Search: "Boulder, Colorado"
Google: 1,600,000
Yahoo: 5,880,000
Search: "Linus Torvalds"
Google: 2,560,000
Yahoo: 5,870,000
I assume it goes on like this. Of course these exceed the 1000 maximum hit limit given in the study.
The most basic measure of performance in Information Retrieval is precision vs. recall.
Precision is how many of the results that you return are correct. e.g. If Google returns 100 results and 10 of them are correct, then the precision on that query is 10%.
Recall is how many of the correct results you return. e.g. If Yahoo returns 100 results out of a total 1000 correct matches, then the recall on that query is 10%.
Information retrieval systems such as search engines balance these two metrics -- which are fundamentally at odds with each other -- to give the "best balance" in the eyes of the system's designers.
The NCSA study basically misses the effect this decision would have on perceived size of index.
A simple demonstration shows how it works.
First let's say both search engines have the same index size: 10B pages. Second, let's say both search engines have exactly the same apriori capability for precision and recall, but can tune for a preferred performance. Yahoo decides it wants to favor more precise results over more results recalled, at a 2:1 relative ratio compared to Google.
In that case, any given query will show half the hits from Yahoo as compared to Google. Concluding Yahoo's index to be half the size of Google's, given this result, would be incorrect.
Furthermore, without knowing the precision/recall performance of either system, they can only demonstrate a lower-bound on index size, and that certainly doesn't predict average or max index size.