NCSA Compares Google and Yahoo Index Numbers
chrisd (former Slashdot editor and now Google employee) writes "Recently, Yahoo claimed an increase of index size to "over 20 billion items", compared to Google's 8.16 billion pages. Now, researchers at NCSA have done their own, independent, comparison of the two engines. "
Honestly, when I first heard the news over the weekend I thought "rubbish, they must be ignoring requests for spiders to go no further or something." I guess NCSA can either 1) Expect no gifts from Yahoo OR 2) Report significantly different results after a sizable gift to NCSA.
75% less truth than other leading brand
A feeling of having made the same mistake before: Deja Foobar
Try searching for the word, "failure" in Google and check the results.
This brings into question *accurate* results. In this case it appears that's left to interpretation.
"Simplify, simplify, simplify!" Thoreau
"Based on the data created from our sample searches, this study concludes that a user can expect, on average, to receive 166.9% more results using the Google search engine than the Yahoo! search engine. In fact, in the 10,012 test cases we ran, only in 3% of the cases (307) did Yahoo! return more results. In 96.6% of the cases (9,676) Google returned more results. In less than 1% of the cases (29) both search engines returned the same number of results. It is the opinion of this study that Yahoo!'s claim to have a web index of over twice as many documents as Googles index is suspicious. Unless a large number of the documents Yahoo! has indexed are not yet available to its search engine, we find it puzzling that Yahoo!'s search engine consistently returned less results than Google. "
but they can't sift through it nearly as well as Google, so what does it matter? Even if you have a bigger dictionnary, if you can't speak English at all it won't do you much good.
I never spellcheck and I freely admit it. Save your karma for more worthwhile "lol erorrs" replies
Sorry, but if Google consistently returns more results, it could just as easily mean that the filtering isn't as good.
I still prefer Google though.
In other words, they believe Google indexes more items based on their own tests of searching.
Tech, life, family, faith: Give me a visit
They only used words from the English Ispell word list. Besides the english-language bias, this is probably limited in other ways. News websites use a limited vocabulary, but a lot of proper names -- so if one engine indexed these better, they wouldn't necessarily get a better rating. News sites are also very dynamic and have a large number of webpages, so they would be influential in the count.
HIV Crosses Species Barrier... into Muppets
Yahoo returns a lot of dupes.
They may have more unique information simply futher down the result list, but since the search engines terminate the results at not quite 1k (1,000), the researchers have no way of testing that out.
All they can really show is that google returns more unique results per 1000 (which usually means that more items are indexed, but could be from Google's Pagerank also)...
There is always a frontier where there is an open and willing mind
Why wget instead of LWP?
(B) + (D) + (B) + (D) = (K) + (&)
TFA notes that queries with greater than 1,000 results were dropped from the survey, because Google and Yahoo both truncate their results to 1,000.
That makes sense, but it does stand to reason (or, at least, to my reason) that these queries that garner large numbers of results could have had a significant impact on the bottom line of the survey.
Those could be the larger sites, where Yahoo is perhaps digging deeper, requesting data from forms, ignoring robots.txt, etc. It could be where they're getting those big claimed numbers of indexed documents.
I don't know about the study but that is the most readable perl code I have seen in a long time.
This is a great article! I wish there were more like it on slashdot. It's scientific instead of an opinion piece, it has references, it's repeatable. It's also short and very readable, unlike a lot of science papers.
OK, it is yet another Google piece, but it's not "some junior analyst predicts Google will buy Apple and release OSX86box 720".
I quit!
The study noted that although Yahoo says that have ~twice as many pages indexed as google, when they queried each engine with two arbitrary words from the dictionary, they got less responses from Yahoo.
From this they concluded yahoo's claim of twice as many pages is suspicious.
What's suspicious is that these people consider themselves scientific. What if, for example, Yahoo just returns meaningful results, whereas google returns anything with those words in? For example, what if you search for "faience" and "urbanity" -- maybe google has more results, but maybe they are less pertinent - in other words maybe not only Yahoo has more pages indexed, but they have an algorithm that returns only the most relevent stuff
Not saying that's the case necessarily, but not mentioning that assumption makes for a worthless study/conclusion. (also if google says they return x results, often when you go to the last page of their results listing you'll notice their total went down, and its more like x - 10%)
-Josh
While it is true that more results could mean worse filtering, that is a separate test entirely.
I tend to think that ordering is more important than filtering down to a small number of results, since having lots of results returned doesn't hurt if the search engine can order well so that what you want is most likely to be in the top 10-25. This is especially true when there will be at most a couple of results where I'd rather have the search engine try at the ordering and have me do most of the filtering because no search engine is as good as a person at really figuring out what people want, yet.
The very methodology used in this case seems rather incorrect to me.
The assumption (as stated in the paper): Since Yahoo claims to have indexed twice as much as google, searches should return twice as many entries.
That assumption is flat out incorrect. There are actually multiple problems.
First, the scope of the search (based on index terms) is really up to the search engine itself. Since each search engine does not return the entire database as search results, it is very much up to the individual search algorithm to determine the depth of entries considered to 'match' a set of terms. That's what is really being reflected in these results.. it is not the overall size of the index, but simply how aggressive the search algorithm is in matching terms to entries.
Even if the algorithms where identical (same algorithm being run across both indexes), the nature of search does not scale in that way. If Yahoo has, for instance, becomre more aggressive in indexing message board and forum content, then only searches that play to those subjects should return more results than Google. Since searches are by definition narrowing on a data set, a methodology needs to be developed that more effectively tests the BREADTH of the results more than simply testing the depth.
Turn s60 photos into awesome videos with mScrapbook for all S60 3rd edition phones!
The study only checked English words. Is it possible that the increase came from Yahoo expanding into more international website markets?
Just a thought
Seriously. 'We wrote a script and here are the results'? This would take an average PERL programmer what -- 30 minutes of work? Has academic research in computing really sunk to this level?
Maybe it's not even worth pointing out how badly flawed (and lazy) the underlying assumption of 'twice the results = twice the index size' probably is, as I'm sure we're going to see a few dozen posts to that effect (unless PageRank really means nothing), but at least I can complain about the slant they put on this, and how strong a conclusion they seem to derive.
The top of the page return for Yahoo is
"Failure on eBay Find failure items at low prices. "
which illustrates the most important difference between Yahoo and Google.
Intron: the portion of DNA which expresses nothing useful.
Google started treating plurals as the same search about a year ago. Yahoo doesn't. So, if you google for "inkjet printers" and "inkjet printer" you will get the same result set; however, on Yahoo, you will get different results.
The net result is that for the same index size, Google will return more results. (And, IMHO, more meaningful ones.)
the number of results anyways? Who makes it to page 5000 when doing a search?
So in the conclusion, the author writes that since Google displayed more results, based on their random test data, it was the superior search engine? That seems so wrong somehow...
Wouldn't a better search engine return less, but more appropriate results? I mean, how many of us have found the information we were actually looking for on page ten or twelve of a search. And, isn't less more, but better? %insert Linux geek laughs here%
One would think that volume of results would not a better search engine make, although it may indicate a larger engine index size; an expicit statement to that effect seems to be missing from the NCSA report.
-Runz
This is just another example in the age old argument of which is better. IMO, the quality of the search results is what matters more than the sheer quantity of information. One relevant find is more valuable than 100 inaccurate results. A test of accuracy might be more valuable and one that would be difficult to engineer. For instance, if I type in a word that has a direct correlating .com domain, that should be the first result (assuming no other words in the title - i.e. "hagrin" brings me my home page as the first result). I am sure a test of accuracy could be further derived from such logic.
The other side of the argument probably relates back to something my fiancee once told me - "Size doesn't matter, but it's the great equalizer when it comes to two guys not knowing what they are doing". Yahoo!, especially since the researches couldn't perform queries on topics returning more than 1,000 results, may be indexing and crawling deeper into sites or it has a "double dipping" problem.
Either way, I don't see Yahoo! falsely reporting their numbers - I would tend to think that this "study" is highly flawed due to its exclusion of larger result topics, etc.
Hagrin.com
Google not only gave MORE results, it gave BETTER results. The only bad results were some hairsplitting (if largely well meant) from fellow /.ers... (I mentioned Tuva as a suburb of Mongolia, and while it IS a part of the Russian Federation, it is Much More Mongolian than Russian. And if the rising tide of neoNazi scum in Russia get their way, Tuva could easily be cut adrift into the Mongolian/Chinese orbit...but I digress...)
The essential point is: Which Does the Job Better For Me? Google. Therefore, I use Google. Assuming the Copernican position that I am not atypical, I would therefore extrapolate that this is very true for most other people as well. Which means that Yahoo has a LONG way to go and A LOT more work to do.
RS
Shoes for Industry. Shoes for the Dead.
Google only reports "about 4,820,000" entries for Britney Spears, while Yahoo reports "about 67,100,000" entries! This makes Yahoo more than 12 times better than google! Yeah, my methodology is completely fucked up... but then, so is the NCSA's!
I've abandoned my search for truth; now I'm just looking for some useful delusions.
Search: Valerie Plame
Google: 908,000
Yahoo: 2,580,000
Search: "Boulder, Colorado"
Google: 1,600,000
Yahoo: 5,880,000
Search: "Linus Torvalds"
Google: 2,560,000
Yahoo: 5,870,000
I assume it goes on like this. Of course these exceed the 1000 maximum hit limit given in the study.
The interesting thing is that the top three results make no reference to the word failure. Of course it is probably based on pages linking to these three, but I wonder if they should even be included for the lack of the search term?
Jumpstart the tartan drive.
Looking at the first item in their result log, I'm unimpressed.
Yahoo returns 0 results, and Google returns... 4 different links to the ispell dictionary (or variants thereof).
('carbolization clambers')
That makes sense, but it does stand to reason (or, at least, to my reason) that these queries that garner large numbers of results could have had a significant impact on the bottom line of the survey.
Well, there's a worse bias. They're grabbing words from an Ispell word list.
There are websites which contain the Ispell word list. There appear to be more of those returned in Google as results than in Yahoo. (here is one returned in Google for "apprizers expense", but which is not returned in Yahoo.)
This basically contributes a pedestal to their result - they'll never get zero results, because they'll always get the Ispell lists back, and because those results always return the same number (about 8 Google to 1 or 2 Yahoo), you'll bias the results of the entire set to that result.
They needed to remove results which are returned in common to multiple searches, as that's essentially double counting.
try it. for example search for "swans" : you got 1 510 000 results, the first one is the SWANS rock band site. search for "swan" then - 8 550 000 results, the first is some SWAN social network - the rockers are not on the first page at all
Deliriant isti Americani.
There's an inherent assumption in the Yahoo claim that more==better. Do I really care if a search returns 1 million results vs 6 million results?
What I care about is actually getting the information I went out to find. There's only a certain amount of hits I'm willing to explore. That's probbably on the order of 100-200 or so if I _really_ need the information. The implication by Yahoo is that more hits == better top ranked hits. Is that true? Really what should be done is just compare the top few hundred hits between the two search engines and see how they differ. Those are the only ones that matter anyway.
Where more results might prove usefull is obscure searches with less than 100-200 hits. But if this study is true, Yahoo does a worse job on obscure searches that google.
The problem of course is the type of obscure searches that this study performed. Two random words out of a dictionary just isn't what your typical person conducting a search engine query is looking for.
AccountKiller
Seems to me there is something wrong when a search term lists pages that don't even have the actual word in it.
donkey rhubarb
Once this comment is spidered, it will work towards PETA coming up when people search for "Donkey" and "rhubarb". If you check the cached version of the GW biography, it will say this at the top.
-mkb
1. Assumes that Yahoo's expansion is random. If the increase in Yahoo's pages are not random, then the results may be skewed. For example, Yahoo's expansion may have been mostly, or even entirely, in pages built of common words that all receive more than 1000 hits upon searching.
2. Assumes, as many people have stated, that by using an English dictionary for its seeds, the study assumes that Yahoo's expansion has been in English. If Yahoo has expanded it's database in non-English pages with few words that overlap into English, those pages will not show up in the study.
This study essentially determines that Google has a larger database of random, obscure English language words. Consequently, they demonstrate that Google is the superior search engine for finding obscure, random English words.
One additional check that they could have thrown in would be how many of the pages in the links presently deliver 404 errors. That would have been far more interesting to me than how well the search engines do at finding obscure and random English words.
Of course the study also demonstrates that on the searched terms, Yahoo's estimate numbers vastly overestimated the number of available results they actually found. So if the pages from the study are even close to representative in that regard then this would make the numbers you quote utterly meaningless.
Which is the entire reason, of course, why they kept the limits under 1,000 in the first place-- that for any number over 1,000, if the search engine says, say, "I found "2.5 million results for 'Valerie Plame'", you have no way to tell whether it's telling the truth or not.
Irritable, left-wing and possibly humorous bumper stickers and t-shirts
With Google for a page to be found, other pages that reference the page may contain the requested words, but not the returned page itself.
The most basic measure of performance in Information Retrieval is precision vs. recall.
Precision is how many of the results that you return are correct. e.g. If Google returns 100 results and 10 of them are correct, then the precision on that query is 10%.
Recall is how many of the correct results you return. e.g. If Yahoo returns 100 results out of a total 1000 correct matches, then the recall on that query is 10%.
Information retrieval systems such as search engines balance these two metrics -- which are fundamentally at odds with each other -- to give the "best balance" in the eyes of the system's designers.
The NCSA study basically misses the effect this decision would have on perceived size of index.
A simple demonstration shows how it works.
First let's say both search engines have the same index size: 10B pages. Second, let's say both search engines have exactly the same apriori capability for precision and recall, but can tune for a preferred performance. Yahoo decides it wants to favor more precise results over more results recalled, at a 2:1 relative ratio compared to Google.
In that case, any given query will show half the hits from Yahoo as compared to Google. Concluding Yahoo's index to be half the size of Google's, given this result, would be incorrect.
Furthermore, without knowing the precision/recall performance of either system, they can only demonstrate a lower-bound on index size, and that certainly doesn't predict average or max index size.
Actually Slashdot prevents robots from spidering its comment pages...
So your point is totally moot...
We all know what to do, but we don't know how to get re-elected once we have done it
it is the reality of state of the Web. At least as far as Google's formula ranks/weights pages/links.
That second part is the important one. If search results can be manipulated by relatively small groups of people, this can be abused, e.g. for search engine spamming, thereby limiting the usefulness of the search engine.
The illegal we do immediately. The unconstitutional takes a little longer.
--Henry Kissinger