Slashdot Mirror


NCSA Compares Google and Yahoo Index Numbers

chrisd (former Slashdot editor and now Google employee) writes "Recently, Yahoo claimed an increase of index size to "over 20 billion items", compared to Google's 8.16 billion pages. Now, researchers at NCSA have done their own, independent, comparison of the two engines. "

11 of 395 comments (clear)

  1. Accurate results? by bigwavejas · · Score: 5, Interesting
    Google sometimes returns some pretty interesting/ entertaining results.

    Try searching for the word, "failure" in Google and check the results.

    This brings into question *accurate* results. In this case it appears that's left to interpretation.

    --
    "Simplify, simplify, simplify!" Thoreau
    1. Re:Accurate results? by jrallison · · Score: 5, Insightful

      It is odd however the #1 result for failure is a webpage without the word "failure" in it.

  2. Flawed conclusion? by Prong_Thunder · · Score: 5, Insightful

    Sorry, but if Google consistently returns more results, it could just as easily mean that the filtering isn't as good.

    I still prefer Google though.

    1. Re:Flawed conclusion? by Ossifer · · Score: 5, Insightful

      Exactly! I find the conclusions of the research to be quite specious. Yahoo may simply have tighter controls of what is considered a match, which, by the way, is no simple algorithm.

      In any case, I am usually not so interested in the numbers of matches, but in the quality of the list returned--hopefully one website will have exactly what I need...

  3. Re:Yahoo returns dupes... by Anonymous Coward · · Score: 5, Funny

    Yahoo returns a lot of dupes.

    If that's the case, then why is Google the darling of slashdot? ;)

  4. More please! by 2008 · · Score: 5, Interesting

    This is a great article! I wish there were more like it on slashdot. It's scientific instead of an opinion piece, it has references, it's repeatable. It's also short and very readable, unlike a lot of science papers.

    OK, it is yet another Google piece, but it's not "some junior analyst predicts Google will buy Apple and release OSX86box 720".

    --
    I quit!
  5. Methodology by enjo13 · · Score: 5, Insightful

    The very methodology used in this case seems rather incorrect to me.

    The assumption (as stated in the paper): Since Yahoo claims to have indexed twice as much as google, searches should return twice as many entries.

    That assumption is flat out incorrect. There are actually multiple problems.

    First, the scope of the search (based on index terms) is really up to the search engine itself. Since each search engine does not return the entire database as search results, it is very much up to the individual search algorithm to determine the depth of entries considered to 'match' a set of terms. That's what is really being reflected in these results.. it is not the overall size of the index, but simply how aggressive the search algorithm is in matching terms to entries.

    Even if the algorithms where identical (same algorithm being run across both indexes), the nature of search does not scale in that way. If Yahoo has, for instance, becomre more aggressive in indexing message board and forum content, then only searches that play to those subjects should return more results than Google. Since searches are by definition narrowing on a data set, a methodology needs to be developed that more effectively tests the BREADTH of the results more than simply testing the depth.

    --
    Turn s60 photos into awesome videos with mScrapbook for all S60 3rd edition phones!
  6. This is what passes for CS research nowadays? by adrizk · · Score: 5, Insightful

    Seriously. 'We wrote a script and here are the results'? This would take an average PERL programmer what -- 30 minutes of work? Has academic research in computing really sunk to this level?

    Maybe it's not even worth pointing out how badly flawed (and lazy) the underlying assumption of 'twice the results = twice the index size' probably is, as I'm sure we're going to see a few dozen posts to that effect (unless PageRank really means nothing), but at least I can complain about the slant they put on this, and how strong a conclusion they seem to derive.

  7. Proper name samples by jkauzlar · · Score: 5, Interesting
    Let's try a few samples of proper names:

    Search: Valerie Plame
    Google: 908,000
    Yahoo: 2,580,000

    Search: "Boulder, Colorado"
    Google: 1,600,000
    Yahoo: 5,880,000

    Search: "Linus Torvalds"
    Google: 2,560,000
    Yahoo: 5,870,000

    I assume it goes on like this. Of course these exceed the 1000 maximum hit limit given in the study.

  8. Re:Yahoo pants down, egg on face, no WMD either. by loose_cannon_gamer · · Score: 5, Insightful
    After reading half the comments on this page, I'm amused at how many alert readers are making the same mistake that they accuse Yahoo of -- misstating results.

    Can we conclude from this study that Google has a bigger index than Yahoo? No. Can we conclude that when you pick two English words that when entered into both Google and Yahoo, both return less than 1000 results, that Google has consistently more results? Yes.

    The real question is, what can we infer from the actual indisputable findings of this study? I find no ready method of generalization. If you are inclined to believe google is better, you feel happy inside. If you think yahoo is better, you have many options to dispute the idea that the study result generalizes to search engine index size.

    As a google fan, I enjoy the warm fuzzies, but I don't see that much to get excited about either way.

    --
    In Soviet Russia, us are belong to all your base.
  9. Re:Conclusion by barawn · · Score: 5, Insightful

    No, it's accurate. They're testing Yahoo's claim of how many pages they've indexed, which just means that all indexed pages that contain the requested words should be returned from the search request. If yahoo returns fewer unique pages, yahoo has indexed fewer pages.

    Actually, it might not be, thanks to their methodology.

    They only used searches with less than 1000 results. They therefore got a lot of searches with small results numbers (because they were searching for bizarre word combinations, like "promotion bedabble"). The total number of results was something like 500,000 or so (order of magnitude) for 10,000 searches. That's an average of 50 results/search, and I'd bet there's a large, large tail, so the most common search is probably something like 10 results.

    The problem with this is that in their word list, the same sites are being returned over and over!. For instance, sites containing dictionary lists appear in both "promotion bedabble" and "foliolate defecations" because, duh, that's the only place they'll appear. Since they're just searching the same type of site over and over, they get the same result magnified a lot: Google has more "dictionary lists" in its index than Yahoo. Most of the "dictionary list" word searches returned about 10-20 for Google, and few, if any, for Yahoo.

    It's a pretty serious flaw in the methodology, as far as I can tell - they're double counting huge numbers of results, and so they're not really getting a good statistical sample of the index.