Slashdot Mirror


NCSA Issues Disclaimer on Google/Yahoo Study

Jean Veronis writes "NCSA has issued a strong disclaimer on the study announced recently on Slashdot that seemed to contradicted the fact that Yahoo's index size would be bigger than Google's: ' Staff at the NCSA noted several issues with the study'. This study conducted by students is 'not an NCSA publication and was not conducted as part of any NCSA project or under the supervision of NCSA'. "

24 of 118 comments (clear)

  1. Disclaimer Text by Stanistani · · Score: 5, Interesting

    From http://vburton.ncsa.uiuc.edu/indexsize.html:
    "The following study was completed by two of Professor Vernon Burton's students at the University of Illinois. Though one of the students previously worked with Professor Burton at the National Center for Supercomputing Applications (NCSA), the study was done outside the scope of any NCSA core projects. When first published online, staff at the NCSA noted several issues with the study, and some revisions have been made to the document to reflect several of these concerns. Changes are detailed at the bottom of the following page.

    Please note again that this study is not an NCSA publication and was not conducted as part of any NCSA project or under the supervision of NCSA.

    A Comparison of the Size of the Yahoo and Google Indices "

  2. ... so? by DrEldarion · · Score: 2, Interesting

    Okay, changes have been made to it, but the outcome is still the same. Why does this matter, then?

    1. Re:... so? by Anonymous Coward · · Score: 2, Interesting

      Preliminary results (from 7000 test queries) indicates that the results of this verification study confirms the conclusions of this study, but final results are still forthcoming.

      Looks like they're still doing some looking to make sure their results are rock solid, but that so far they seem to be. As such, the current state of reality is that the fact is that Google has a must bigger index of the world wide web (or Internet, or whatever you want to call it) than Yahoo. Yahoo may have a bigger index squirreled away somewhere, but they are not making it available to the public via their search. As such, their CEO's (mis)statements about such are very much misleading and in dispute. Anyways, as anyone who searches knows, Google is better.*

      --
      * The preceding was my opinion of the facts. YMMV. Please search responsibly.

  3. /. 503 error by dhasenan · · Score: 2, Interesting

    Off topic...

    Anyone else get 503 errors when trying to reach Slashdot?

    Where do you go to talk about Slashdot being Slashdotted?

    1. Re:/. 503 error by paulius_g · · Score: 2, Informative

      Glad you asked...

      I've been getting 500 errors the whole morning while trying to reach /. But not 503 ones. After one or two page refreshes, it starts working!

    2. Re:/. 503 error by Anonymous Coward · · Score: 5, Funny
      I've been getting 500 errors the whole morning while trying to reach /. But not 503 ones. After one or two page refreshes, it starts working!


      The trick is to refresh as fast as you can, until the bad 500 errors go away.

  4. A crucial issue... by d3m057h3n35 · · Score: 5, Funny

    Also pertinent was the discovery that Yahoo's claims to increased index size were based on the hope that buying products from companies which advertise "longer, thicker index size in two weeks, money-back guarantee, all-natural supplements" would yield actual results.

  5. Wait... by lbmouse · · Score: 5, Funny

    I thought that size didn't matter.

  6. But why publish it? by ChrisF79 · · Score: 2, Insightful

    Although they don't say it in the disclaimer, their actions of posting a disclaimer after posting the article screams that they realize the article is flawed. If that's the case, why publish it in the first place? Shouldn't they have had some foresight and left this one on the cutting room floor? Maybe Finance is different, but I remember it being very difficult to get an article published unless it was groundbreaking and free from any minor flaws.

    --
    Finance tutorials and more! Understandfinance
    1. Re:But why publish it? by 'nother+poster · · Score: 3, Insightful

      From the disclaimer I would say thet the report was not a university sanctioned project, but a funtime project for a couple of students. They then published it in a manner that implied that it was offical work of the university, or at least sanctioned by the professor. Now, whether the study is right or wrong come peer review, the university wants it known that it wasn't their project. A peer reviewed research project is much different than throwing together a bad stats class midterm and putting the results on a university server.

  7. Filtering by Spazmania · · Score: 4, Insightful

    Readers can consult the list of search terms provided by the authors, and can see for themselves that, in the vast majority of cases retained (i.e. those with fewer than 1000 results), the results in question are lists and spam.

    I don't know which disturbs me more: The possibility that this is the correct explanation for the discrepancy or the possibility that it isn't.

    It seems to me that the correct solution to filtering results would be to put the "undesirable" results at the bottom of the list, not get rid of them entirely. One man's trash is another man's treasure after all.

    --
    Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
  8. Covering Ones Rear by gkozlyk · · Score: 3, Insightful

    Ah, the good old disclaimer added to cover ones rear. With litigation flying free as newspaper in the wind, one can't be to careful these days.

    --
  9. Why is the disclaimer needed? by frdmfghtr · · Score: 2, Interesting

    I didn't read the article the first time around, so maybe something was changed/removed that prompted the disclaimer. I read the report and couldn't find a single reference to NCSA, except in the URL and in the disclaimer itself.

    Aside from the URL, was there some sort of NCSA association implied or claimed in the original post then removed?

    --
    Government's idea of a balanced budget: take money from the right pocket to balance...oh who am I kidding?
  10. The dark web by SpinyNorman · · Score: 5, Insightful

    The Yahoo vs Google page count methodology of counting numbers of pages returned for various high-response queries seems to be completely ignoring the fact that Yahoo *might be* picking up some of the less highly linked-to "dark web" that Google's page rank alogorithm are going to rate lowly, and which their crawler may be ignoring.

    This is the portion of the web that I'd like to see - not the commerical portion but the hobbyist and enthusiast sites that may be out there without lots of incoming links that would make them more highly rated and/or visible to Google.

    What'd therefore be relevant and interesting to know isn't how many hundreds of pages Google vs Yahoo get for "my job sucks", but rather how many it gets for "my weevil collection".

    1. Re:The dark web by markov_chain · · Score: 2, Interesting

      Interesting.

      The original search rating papers (Kleinberg's algorithm, PageRank) made the ground-breaking observation that links between pages contained lots of useful information that could be used for ranking in addition to the keywords contained in the pages themselves. This was in a time where websites were mostly personal, and there was an atmosphere of friendly sharing of information where people would link to other sites that they find interesting. However, how much have things changed today? Who still has their own web page? Aren't links mostly commercial? Like you said, there may be valuable sites out there that are getting ignored because of over-reliance on links as source of reputation.

      Maybe it's time for a "people who went to site A also went to site B" technology. It would require running a client-side traffic monitor that would build these adjacency lists and send them back. If it was open sourced and anonymous, the privacy concerns would be minimal, and it would provide a usage-based source of reputation.

      --
      Tsunami -- You can't bring a good wave down!
    2. Re:The dark web by RAMMS+EIN · · Score: 2, Insightful

      ``This is the portion of the web that I'd like to see - not the commerical portion but the hobbyist and enthusiast sites that may be out there without lots of incoming links that would make them more highly rated and/or visible to Google.''

      I personally don't think Google is _excluding_ pages that somehow don't get enough links to them. Typically, good resources will get linked to, and thus taking into account the number of links to a page seems sensible.

      From personal experience, I can't say I have anything to complain about with Google. When I post a new page on my site that includes some word that previously had few hits on Google, it gets to the top of the results within a few days. So, even without many links, the system works. When I search for words that do return many hits, the results I get first are usually the most relevant (provided that I have entered enough words to place everything in proper context; searching for "festival" wouldn't give me the speech synthesis software unless I also included "speech").

      If you are specifically looking for pages that have few links to them, another search engine might be better for you. Or maybe not. Maybe you would be best served by using Google and looking at the last rather than first results. Perhaps it would be a good idea for Google to include an option to invert the ranking?

      --
      Please correct me if I got my facts wrong.
  11. Maybe those pages never were crawled by yahoo. by MushMouth · · Score: 4, Interesting

    It is entirely possible that Yahoo not only crawls more content on a whole, but the percentage of the content that they crawl that is spam is smaller. Crawlers get to spam/SEO pages via other spam/SEO pages, thus if they have better filtering mechanisms they may simply stop earlier and put their (not completely unlimited) resources into crawling more useful data.

  12. trust by dioscaido · · Score: 4, Funny

    If it made it through the Slashdot filters, then the study is good enough for me.

  13. It was not "published" by kaan · · Score: 4, Insightful

    why publish it in the first place?

    Dude, it was never published, it was posted on one web server that is part of the ncsa.uiuc.edu sub-domain (specifically, vburton.ncsa.uiuc.edu). There are probably hundreds of machines that are in this network, and posting something on a web server running there does not equate to NCSA formally publishing an article. What we're talking about here is a web page written by two students, they worked on a project, they wanted to post it for other people to see. So that's what they did, period.

    Stupidly, everyone is claiming that NCSA backed this whole thing, like they (NCSA) are on some crusade to compare Yahoo and Google. But this must be taken for what it is - a project by two students. NCSA's disclaimer is just trying to make this clear for the idiots out there who think that every little thing a student says or does must have been funded, supported, backed, etc. by NCSA.

  14. Accuracy of Google counts? by xiaomonkey · · Score: 5, Interesting
    Try the following sets of key words on Google: This trend appears to continue, as seen in that repeating the "lawyer" keyword 10 times results in Google estimating that there are 389,000,000 hits in it's index.

    On yahoo, this sort of thing doesn't seem to happen as much, but it still does happen. For example, searching for "laywer" returns 124,000,000 results, and searching for "lawyer lawyer" or "lawyer lawyer" returns 125,000,000 results.

    So, it probably doesn't really make seen to judge the relative size of either index based on the estimated number of hits for any given set of keywords in their index. Right now, Google's numbers look a little more suspect since they seem to variety so greatly just based on the repetition of a keyword. However, the stability of Yahoo's numbers don't necessarily mean that they're correct either.
  15. This may be true by lcsjk · · Score: 2, Funny
    I understand that Google uses a very efficient compression technology to compress documents before they are indexed, thereby making characters so small that they can only be read with a magnifying glass or microscope.

    In contrast, Yahoo, unless I misunderstand, only compresses the file after it has been indexed. Since only the file is compressed and not the individual characters, they indeed have a larger index file as the study concluded. :)

  16. More thoughts on a better test by freality · · Score: 4, Interesting

    After criticising the study when I first saw it, I now have some constructive ideas on how to perform a better test of the relative search performance of the two engines.

    - Crawler Test

    Do a search of "microsoft site:microsoft.com" in Google, Yahoo and MSN search. Assume that Microsoft knows how to crawl its own site completely and judge the relative strength of Google's and Yahoo's crawlers based on how many of those pages they find. Google easily wins.

    Unfortunately, his test doesn't work with Yahoo as the reference site since Google returns no hits for "yahoo site:yahoo.com". This is very disappointing, as Yahoo is one of the largest sites on the Web. Neither Yahoo or MSN share this self-censorship policy.

    Another test of the same kind is "amazon site:amazon.com", since amazon is a very big site which everyone is presumably very interested in crawling well. Unfortunately, amazon doesn't allow this kind of search on themselves via their A9 engine.

    This is an interesting test because it compares the very likely actual size of a site (i.e. Microsoft's reported size for microsoft.com) with the reported size by the second-party search engine. This may be the best 3rd party test of a crawler's Recall on a per-site basis.

    - Common Word Test

    Surprisingly, Google, Yahoo and MSN all now allow stop-word searching. Stop words are words like "the", "a", "it", etc.. A search for "the" on each show Yahoo significantly in the lead.

    This is an interesting test because the word "the" is probably as uniformly distributed in the source webpage population than the random phrases used in the NCSA test (or possibly moreso), plus this word is more likely than any other to occur in every web document (at least with English). These two characteristics mean that finding the most pages with "the" may be the closest approximation of an actual Recall measurement for all sites on the web that can be done without a prior-knowledge testing set.

    - Conclusion

    Google and MSN seem to have high-quality crawlers, whereas Yahoo makes up with a much larger current index size. However, take this with a grain of salt as it's hard to measure anything without knowing how the sites do Precision/Recall tradeoffs, and there may be substantial structural differences in the way the sites index pages to start with.

    Even assuming these results, Yahoo's bigger index isn't necessarily more desirable. A better crawler can yield a bigger and better index with relatively little work (just add more machines to store it on), while there is no easy fix for the meticulous work needed to ensure a crawler is getting to all the nooks and crannies of the Web.

    Longer-term, it is these caveats that make an Open-Source approach to crawling shine by comparisson. Take for example the Nutch search engine. Though it is in its early days, there need be no doubt to its cralwing and ranking algorithms as well as its Precision/Recall tradeoff.

  17. Study: Red Delicious apples != Fuji apples by RunzWithScissors · · Score: 3, Interesting

    I got flamed for proposing this theory when the article was first posted on /.

    One major problem with the study, not really addressed by this problems article is one of comparison. Yes, Yahoo! and Google are two search engines, but they perform their searches differently, and more importantly, use different criteria when returning matches! It is quite possible that when doing the exact same search in both yeilds a difference in results. Why? Because the two different search engines have different criteria for providing output, or matches if you will, from said search. Perhaps Yahoo! does indeed have more pages indexed, but because of their search algorithm, or their program which displays results to the user, less matches are provided; even though more pages were looked at.

    I'm not saying that Yahoo! does have more or less pages than Google. But the study that was executed and published did not account for many of the differences between the products they were comparing. The above is merely another interpretation of said results. I'm glad that some folks at NCSA agree and provided some clarification; lest we get another urban myth like storks bring babies.

    -Runz

  18. Re:Accuracy of Google counts? (oblig.) by CycleMan · · Score: 2, Funny
    lawyer - results 29,300,000
    lawyer lawyer - results 29,300,000
    lawyer lawyer lawyer - results 62,000,000
    lawyer lawyer lawyer lawyer - results 78,600,000

    lawyer lawyer lawyer lawyer
    lawyer lawyer lawyer lawyer
    lawyer lawyer lawyer lawyer
    LAW SUIT LAW SUIT!

    lawyer lawyer lawyer lawyer
    lawyer lawyer ...