Slashdot Mirror


NCSA Compares Google and Yahoo Index Numbers

chrisd (former Slashdot editor and now Google employee) writes "Recently, Yahoo claimed an increase of index size to "over 20 billion items", compared to Google's 8.16 billion pages. Now, researchers at NCSA have done their own, independent, comparison of the two engines. "

31 of 395 comments (clear)

  1. Yahoo pants down, egg on face, no WMD either. by ackthpt · · Score: 3, Interesting
    So the summary is in all but 3% of the time, Yahoo finds less pages than Google and that 18 bi1110nz Mayer claimed are a number he pulled right out of his own arse.

    Honestly, when I first heard the news over the weekend I thought "rubbish, they must be ignoring requests for spiders to go no further or something." I guess NCSA can either 1) Expect no gifts from Yahoo OR 2) Report significantly different results after a sizable gift to NCSA.

    75% less truth than other leading brand

    --

    A feeling of having made the same mistake before: Deja Foobar
    1. Re:Yahoo pants down, egg on face, no WMD either. by Iriel · · Score: 3, Interesting

      I think it is possible that Yahoo! has more items indexed than Google. It may not be true after all, but one has to give thought to the fact that Yahoo can search subscription based content. That has got to boost their numbers considerably beyond the range of queries that typically return less than one thousand results. It's possible that Yahoo! could have simply been fudging the numbers to get some press now that they're actually starting to get noticed again. I can't make a certain conjecture in either direction, but don't totally discredit Yahoo! without looking into everything.

      --
      Perfecting Discordia
      www.stevenvansickle.com
    2. Re:Yahoo pants down, egg on face, no WMD either. by loose_cannon_gamer · · Score: 5, Insightful
      After reading half the comments on this page, I'm amused at how many alert readers are making the same mistake that they accuse Yahoo of -- misstating results.

      Can we conclude from this study that Google has a bigger index than Yahoo? No. Can we conclude that when you pick two English words that when entered into both Google and Yahoo, both return less than 1000 results, that Google has consistently more results? Yes.

      The real question is, what can we infer from the actual indisputable findings of this study? I find no ready method of generalization. If you are inclined to believe google is better, you feel happy inside. If you think yahoo is better, you have many options to dispute the idea that the study result generalizes to search engine index size.

      As a google fan, I enjoy the warm fuzzies, but I don't see that much to get excited about either way.

      --
      In Soviet Russia, us are belong to all your base.
  2. Accurate results? by bigwavejas · · Score: 5, Interesting
    Google sometimes returns some pretty interesting/ entertaining results.

    Try searching for the word, "failure" in Google and check the results.

    This brings into question *accurate* results. In this case it appears that's left to interpretation.

    --
    "Simplify, simplify, simplify!" Thoreau
    1. Re:Accurate results? by jrallison · · Score: 5, Insightful

      It is odd however the #1 result for failure is a webpage without the word "failure" in it.

    2. Re:Accurate results? by MindStalker · · Score: 4, Insightful

      Well google also indexes based upon refering links and not just the context in the page itself. So if many websites refer to GW as a failure, GWs page itself will turn up as a high hit. Yahoo does this as well, but doesn't not nessesarly give it the same weight. This could highly affect amounts of returns. Because if we say that google returned X pages for a search on term "y" many of these pages may not actually mention "y" thus giving a larger page count for "y". While with yahoos method, it will mainly return pages that mention "y" themself. And possibly add some pages that are mentioned to include "y" by links. This can vastly alter the count.

  3. Conclusion by mboverload · · Score: 3, Informative

    "Based on the data created from our sample searches, this study concludes that a user can expect, on average, to receive 166.9% more results using the Google search engine than the Yahoo! search engine. In fact, in the 10,012 test cases we ran, only in 3% of the cases (307) did Yahoo! return more results. In 96.6% of the cases (9,676) Google returned more results. In less than 1% of the cases (29) both search engines returned the same number of results. It is the opinion of this study that Yahoo!'s claim to have a web index of over twice as many documents as Googles index is suspicious. Unless a large number of the documents Yahoo! has indexed are not yet available to its search engine, we find it puzzling that Yahoo!'s search engine consistently returned less results than Google. "

    1. Re:Conclusion by nutshell42 · · Score: 4, Insightful
      And Nutshell42's New Amazing Search Engine gives you even more results. Even though my index size is only 1.something million. I simply return every single wikipedia article in every language as result no matter what you search.

      Concluding that Yahoo's index has to be smaller because they return fewer results seems a bit overzealous. Only a thorough study comparing results and how useful they were (which is hard to do, expensive and time consuming) has any meaning that goes beyond producing lots of funny numbers and percentages.

      96.34% of all percentages are completely useless.

      btw. I use google, not yahoo

      --
      Don't think of it as a flame---it's more like an argument that does 3d6 fire damage
    2. Re:Conclusion by rossifer · · Score: 3, Insightful

      Concluding that Yahoo's index has to be smaller because they return fewer results seems a bit overzealous.

      No, it's accurate. They're testing Yahoo's claim of how many pages they've indexed, which just means that all indexed pages that contain the requested words should be returned from the search request. If yahoo returns fewer unique pages, yahoo has indexed fewer pages.

      What you're talking about is measuring the effectiveness of page ranking, which is a completely different measure of how good a search engine is. Note: Google wins on that measure too.

      Regards,
      Ross

    3. Re:Conclusion by barawn · · Score: 5, Insightful

      No, it's accurate. They're testing Yahoo's claim of how many pages they've indexed, which just means that all indexed pages that contain the requested words should be returned from the search request. If yahoo returns fewer unique pages, yahoo has indexed fewer pages.

      Actually, it might not be, thanks to their methodology.

      They only used searches with less than 1000 results. They therefore got a lot of searches with small results numbers (because they were searching for bizarre word combinations, like "promotion bedabble"). The total number of results was something like 500,000 or so (order of magnitude) for 10,000 searches. That's an average of 50 results/search, and I'd bet there's a large, large tail, so the most common search is probably something like 10 results.

      The problem with this is that in their word list, the same sites are being returned over and over!. For instance, sites containing dictionary lists appear in both "promotion bedabble" and "foliolate defecations" because, duh, that's the only place they'll appear. Since they're just searching the same type of site over and over, they get the same result magnified a lot: Google has more "dictionary lists" in its index than Yahoo. Most of the "dictionary list" word searches returned about 10-20 for Google, and few, if any, for Yahoo.

      It's a pretty serious flaw in the methodology, as far as I can tell - they're double counting huge numbers of results, and so they're not really getting a good statistical sample of the index.

  4. They might have a larger index file by BlackCobra43 · · Score: 4, Insightful

    but they can't sift through it nearly as well as Google, so what does it matter? Even if you have a bigger dictionnary, if you can't speak English at all it won't do you much good.

    --
    I never spellcheck and I freely admit it. Save your karma for more worthwhile "lol erorrs" replies
  5. Flawed conclusion? by Prong_Thunder · · Score: 5, Insightful

    Sorry, but if Google consistently returns more results, it could just as easily mean that the filtering isn't as good.

    I still prefer Google though.

    1. Re:Flawed conclusion? by Ossifer · · Score: 5, Insightful

      Exactly! I find the conclusions of the research to be quite specious. Yahoo may simply have tighter controls of what is considered a match, which, by the way, is no simple algorithm.

      In any case, I am usually not so interested in the numbers of matches, but in the quality of the list returned--hopefully one website will have exactly what I need...

  6. The results by Swamii · · Score: 4, Interesting
    For those that don't want to read the flippin' article:

    Based on this random sample, we found that on average Yahoo! only returns 37.4% of the results that Google does and, in many cases, returns significantly less.


    In other words, they believe Google indexes more items based on their own tests of searching.
    --
    Tech, life, family, faith: Give me a visit
  7. English Language by morcheeba · · Score: 3, Insightful

    They only used words from the English Ispell word list. Besides the english-language bias, this is probably limited in other ways. News websites use a limited vocabulary, but a lot of proper names -- so if one engine indexed these better, they wouldn't necessarily get a better rating. News sites are also very dynamic and have a large number of webpages, so they would be influential in the count.

  8. Yahoo returns dupes... by Marnhinn · · Score: 3, Insightful

    Yahoo returns a lot of dupes.

    They may have more unique information simply futher down the result list, but since the search engines terminate the results at not quite 1k (1,000), the researchers have no way of testing that out.

    All they can really show is that google returns more unique results per 1000 (which usually means that more items are indexed, but could be from Google's Pagerank also)...

    --
    There is always a frontier where there is an open and willing mind
    1. Re:Yahoo returns dupes... by Anonymous Coward · · Score: 5, Funny

      Yahoo returns a lot of dupes.

      If that's the case, then why is Google the darling of slashdot? ;)

    2. Re:Yahoo returns dupes... by NickFortune · · Score: 3, Insightful
      On the other hand, their findings to cast doubt upon Yahoo's claims regarding index size.

      These findings don't do anything of the sort. In fact, Google could have only 999 pages in index, and if it returned all 999 for every query it would have won this test. There's too many assumptions here for the results to be useful.

      'Scuse me: I said "cast doubt upon" not "conclusively disproved".

      If Yahoo's indices are, as they claim, more than twice the size of Google's, then we might reasonably expect them to return more hits for an arbitary query. That they do not do so suggests that Yahoo may well be telling fibs.

      Yes, there are other explanations, like for example, Google deliberately falsifying all sub 1000 hit queries, as you point out. However, one likely, arguably the most likely explanation is that Yahoo is being a bit sparing with the truth in its press releases.

      Hence "cast doubt upon".

      --
      Don't let THEM immanentize the Eschaton!
  9. Queries with 1,000 results by Whafro · · Score: 3, Interesting

    TFA notes that queries with greater than 1,000 results were dropped from the survey, because Google and Yahoo both truncate their results to 1,000.

    That makes sense, but it does stand to reason (or, at least, to my reason) that these queries that garner large numbers of results could have had a significant impact on the bottom line of the survey.

    Those could be the larger sites, where Yahoo is perhaps digging deeper, requesting data from forms, ignoring robots.txt, etc. It could be where they're getting those big claimed numbers of indexed documents.

  10. Perl Code by hayro · · Score: 4, Funny

    I don't know about the study but that is the most readable perl code I have seen in a long time.

  11. More please! by 2008 · · Score: 5, Interesting

    This is a great article! I wish there were more like it on slashdot. It's scientific instead of an opinion piece, it has references, it's repeatable. It's also short and very readable, unlike a lot of science papers.

    OK, it is yet another Google piece, but it's not "some junior analyst predicts Google will buy Apple and release OSX86box 720".

    --
    I quit!
  12. Methodology by enjo13 · · Score: 5, Insightful

    The very methodology used in this case seems rather incorrect to me.

    The assumption (as stated in the paper): Since Yahoo claims to have indexed twice as much as google, searches should return twice as many entries.

    That assumption is flat out incorrect. There are actually multiple problems.

    First, the scope of the search (based on index terms) is really up to the search engine itself. Since each search engine does not return the entire database as search results, it is very much up to the individual search algorithm to determine the depth of entries considered to 'match' a set of terms. That's what is really being reflected in these results.. it is not the overall size of the index, but simply how aggressive the search algorithm is in matching terms to entries.

    Even if the algorithms where identical (same algorithm being run across both indexes), the nature of search does not scale in that way. If Yahoo has, for instance, becomre more aggressive in indexing message board and forum content, then only searches that play to those subjects should return more results than Google. Since searches are by definition narrowing on a data set, a methodology needs to be developed that more effectively tests the BREADTH of the results more than simply testing the depth.

    --
    Turn s60 photos into awesome videos with mScrapbook for all S60 3rd edition phones!
  13. International Listings by Dominatus · · Score: 4, Insightful

    The study only checked English words. Is it possible that the increase came from Yahoo expanding into more international website markets?

    Just a thought

  14. This is what passes for CS research nowadays? by adrizk · · Score: 5, Insightful

    Seriously. 'We wrote a script and here are the results'? This would take an average PERL programmer what -- 30 minutes of work? Has academic research in computing really sunk to this level?

    Maybe it's not even worth pointing out how badly flawed (and lazy) the underlying assumption of 'twice the results = twice the index size' probably is, as I'm sure we're going to see a few dozen posts to that effect (unless PageRank really means nothing), but at least I can complain about the slant they put on this, and how strong a conclusion they seem to derive.

  15. Re:What would you want them to return? by Intron · · Score: 4, Insightful

    The top of the page return for Yahoo is

    "Failure on eBay Find failure items at low prices. "

    which illustrates the most important difference between Yahoo and Google.

    --
    Intron: the portion of DNA which expresses nothing useful.
  16. Google parses plurals differently. by WoTG · · Score: 3, Interesting

    Google started treating plurals as the same search about a year ago. Yahoo doesn't. So, if you google for "inkjet printers" and "inkjet printer" you will get the same result set; however, on Yahoo, you will get different results.

    The net result is that for the same index size, Google will return more results. (And, IMHO, more meaningful ones.)

  17. More results == better search engine? by RunzWithScissors · · Score: 3, Insightful

    So in the conclusion, the author writes that since Google displayed more results, based on their random test data, it was the superior search engine? That seems so wrong somehow...

    Wouldn't a better search engine return less, but more appropriate results? I mean, how many of us have found the information we were actually looking for on page ten or twelve of a search. And, isn't less more, but better? %insert Linux geek laughs here%

    One would think that volume of results would not a better search engine make, although it may indicate a larger engine index size; an expicit statement to that effect seems to be missing from the NCSA report.

    -Runz

  18. Results of my own study... by Locke2005 · · Score: 4, Funny

    Google only reports "about 4,820,000" entries for Britney Spears, while Yahoo reports "about 67,100,000" entries! This makes Yahoo more than 12 times better than google! Yeah, my methodology is completely fucked up... but then, so is the NCSA's!

    --
    I've abandoned my search for truth; now I'm just looking for some useful delusions.
  19. Proper name samples by jkauzlar · · Score: 5, Interesting
    Let's try a few samples of proper names:

    Search: Valerie Plame
    Google: 908,000
    Yahoo: 2,580,000

    Search: "Boulder, Colorado"
    Google: 1,600,000
    Yahoo: 5,880,000

    Search: "Linus Torvalds"
    Google: 2,560,000
    Yahoo: 5,870,000

    I assume it goes on like this. Of course these exceed the 1000 maximum hit limit given in the study.

  20. Those are estimates by mcc · · Score: 4, Insightful

    Of course the study also demonstrates that on the searched terms, Yahoo's estimate numbers vastly overestimated the number of available results they actually found. So if the pages from the study are even close to representative in that regard then this would make the numbers you quote utterly meaningless.

    Which is the entire reason, of course, why they kept the limits under 1,000 in the first place-- that for any number over 1,000, if the search engine says, say, "I found "2.5 million results for 'Valerie Plame'", you have no way to tell whether it's telling the truth or not.

  21. Holy lack of IR stastics understanding, Batman! by freality · · Score: 4, Interesting

    The most basic measure of performance in Information Retrieval is precision vs. recall.

    Precision is how many of the results that you return are correct. e.g. If Google returns 100 results and 10 of them are correct, then the precision on that query is 10%.

    Recall is how many of the correct results you return. e.g. If Yahoo returns 100 results out of a total 1000 correct matches, then the recall on that query is 10%.

    Information retrieval systems such as search engines balance these two metrics -- which are fundamentally at odds with each other -- to give the "best balance" in the eyes of the system's designers.

    The NCSA study basically misses the effect this decision would have on perceived size of index.

    A simple demonstration shows how it works.

    First let's say both search engines have the same index size: 10B pages. Second, let's say both search engines have exactly the same apriori capability for precision and recall, but can tune for a preferred performance. Yahoo decides it wants to favor more precise results over more results recalled, at a 2:1 relative ratio compared to Google.

    In that case, any given query will show half the hits from Yahoo as compared to Google. Concluding Yahoo's index to be half the size of Google's, given this result, would be incorrect.

    Furthermore, without knowing the precision/recall performance of either system, they can only demonstrate a lower-bound on index size, and that certainly doesn't predict average or max index size.