NCSA Issues Disclaimer on Google/Yahoo Study
Jean Veronis writes "NCSA has issued a strong disclaimer on the study announced recently on Slashdot that seemed to contradicted the fact that Yahoo's index size would be bigger than Google's: ' Staff at the NCSA noted several issues with the study'. This study conducted by students is 'not an NCSA publication and was not conducted as part of any NCSA project or under the supervision of NCSA'. "
I'm not sure whether the claimed market penetration is really correct as it contradicts the Gartner studies from 2002 and 2004.
--
My dong vary long.
From http://vburton.ncsa.uiuc.edu/indexsize.html:
"The following study was completed by two of Professor Vernon Burton's students at the University of Illinois. Though one of the students previously worked with Professor Burton at the National Center for Supercomputing Applications (NCSA), the study was done outside the scope of any NCSA core projects. When first published online, staff at the NCSA noted several issues with the study, and some revisions have been made to the document to reflect several of these concerns. Changes are detailed at the bottom of the following page.
Please note again that this study is not an NCSA publication and was not conducted as part of any NCSA project or under the supervision of NCSA.
A Comparison of the Size of the Yahoo and Google Indices "
You can't talk about Wikipedia's flaws on Wikipedia
Okay, changes have been made to it, but the outcome is still the same. Why does this matter, then?
Off topic...
Anyone else get 503 errors when trying to reach Slashdot?
Where do you go to talk about Slashdot being Slashdotted?
Also pertinent was the discovery that Yahoo's claims to increased index size were based on the hope that buying products from companies which advertise "longer, thicker index size in two weeks, money-back guarantee, all-natural supplements" would yield actual results.
I thought that size didn't matter.
They probably almost got their ass sued, hah, hah...
They asked for it... Within days (ok, maybe weeks) of Yahoo's announcement they think up, prepare and conduct a "study". Riiight.
Unfortunately that's not a CVS tree that one can do updates and send diffs as they please.
And the bozos used the university site to publish such mambo-jumbo study. Very professional!
Although they don't say it in the disclaimer, their actions of posting a disclaimer after posting the article screams that they realize the article is flawed. If that's the case, why publish it in the first place? Shouldn't they have had some foresight and left this one on the cutting room floor? Maybe Finance is different, but I remember it being very difficult to get an article published unless it was groundbreaking and free from any minor flaws.
Finance tutorials and more! Understandfinance
Readers can consult the list of search terms provided by the authors, and can see for themselves that, in the vast majority of cases retained (i.e. those with fewer than 1000 results), the results in question are lists and spam.
I don't know which disturbs me more: The possibility that this is the correct explanation for the discrepancy or the possibility that it isn't.
It seems to me that the correct solution to filtering results would be to put the "undesirable" results at the bottom of the list, not get rid of them entirely. One man's trash is another man's treasure after all.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Everybody seems to fear Google now that it is sooo big.
The temptations to become more directive will begin to creep up!
The fact that a study conducted by students got mention on /. is impressive. Usually, most works done by students are ignored as class exercises. Now "retracted" can be added to the list.
Ah, the good old disclaimer added to cover ones rear. With litigation flying free as newspaper in the wind, one can't be to careful these days.
I didn't read the article the first time around, so maybe something was changed/removed that prompted the disclaimer. I read the report and couldn't find a single reference to NCSA, except in the URL and in the disclaimer itself.
Aside from the URL, was there some sort of NCSA association implied or claimed in the original post then removed?
Government's idea of a balanced budget: take money from the right pocket to balance...oh who am I kidding?
The Yahoo vs Google page count methodology of counting numbers of pages returned for various high-response queries seems to be completely ignoring the fact that Yahoo *might be* picking up some of the less highly linked-to "dark web" that Google's page rank alogorithm are going to rate lowly, and which their crawler may be ignoring.
This is the portion of the web that I'd like to see - not the commerical portion but the hobbyist and enthusiast sites that may be out there without lots of incoming links that would make them more highly rated and/or visible to Google.
What'd therefore be relevant and interesting to know isn't how many hundreds of pages Google vs Yahoo get for "my job sucks", but rather how many it gets for "my weevil collection".
It is entirely possible that Yahoo not only crawls more content on a whole, but the percentage of the content that they crawl that is spam is smaller. Crawlers get to spam/SEO pages via other spam/SEO pages, thus if they have better filtering mechanisms they may simply stop earlier and put their (not completely unlimited) resources into crawling more useful data.
This text is the content of the first link of the news post. It's fucking REDUNDANT, not interesting.
If you want a 503 error, try it 3 more times.
I think the point is that copies of the ispell dictionary and spam are repetitive, which are normally not included with search results. Why do you need more than one copy of an identical result?
If it made it through the Slashdot filters, then the study is good enough for me.
why publish it in the first place?
Dude, it was never published, it was posted on one web server that is part of the ncsa.uiuc.edu sub-domain (specifically, vburton.ncsa.uiuc.edu). There are probably hundreds of machines that are in this network, and posting something on a web server running there does not equate to NCSA formally publishing an article. What we're talking about here is a web page written by two students, they worked on a project, they wanted to post it for other people to see. So that's what they did, period.
Stupidly, everyone is claiming that NCSA backed this whole thing, like they (NCSA) are on some crusade to compare Yahoo and Google. But this must be taken for what it is - a project by two students. NCSA's disclaimer is just trying to make this clear for the idiots out there who think that every little thing a student says or does must have been funded, supported, backed, etc. by NCSA.
If I am searching japanese or chinese content, I'd go to Yahoo instead of Google.
what
I think you're right, the original article had no visible association with NCSA other than the url. But this is just like the classic telephone game: I tell you something, you repeat it to somebody else with a minor addition/change, then that person tells somebody else, etc. By the time it goes 4 or 5 hops, it's been totally twisted around, and my original message has turned into something idiotic, and everyone thinks I said it. This is exactly what happened here, because it started showing up on blogs, and then news sites started writing about it.
Overall, everyone was more or less accurate with regard to the articles details and results, etc., but the fact that this was just a single web page posted on a single web server in the ncsa.uiuc.edu subdomain was lost on everyone. People did not carry that important detail along, and over time it morphed into something else. Pretty quickly, we started seeing articles like, "NCSA Compares Google and Yahoo Index Numbers" appear on slashdot, which is hugely popular, and suddenly the whole world thought that the National Center for Supercomputing Applications was on a crusade to figure out which search engine is better. Hence the disclaimer from NCSA to formally tell the world that this "article" was "published" by two students, and that's all.
Your wish fulfilled:
Google: Approx 47,100
Yahoo: Approx 258,000
Since both sites will only give access to the first 1000 hits (a major gripe of mine), WTF good is it?
I want to be able to see hit # 441,874,356 if I want. I may not be looking for the winner of a popularity contest.
I, mean, like, cool, dude!
Ooh, you have a low Slashdot ID, yes you do, ooh!
Don't link to slashdot!!!!
/.'s audience) clicking th elink, making slashdot slashdot itself.
It leads to the nerd wannabes (85% of
Link to the Goooooogle cache instead. I mean, nobody will miss Gooooooooogle when it gets slashdotted!
FFS!
Shut up about Google already!
Google (the 'Do No Evil' tm.) Corp is becoming as bad as MS as every day goes by. I predict they will get even worse before trends and hubris brings them down.
Whitness the A and B share debacle where the Google Gods get 10x more voting rights than Joe Shareholder.
Google sucks sucks sucks!!!
I don't see why this was modded down. It really took away from the summary. Granted, the sentence makes no sense WITHOUT the error.
I know it's been said before, but you cannot just measure search engines based on volume of hits returned. Clearly, when you get into the millions, it doesn't hurt the results to prune some crap off the end, and I'm sure they're both doing things -- either one could easily focus a little on breadth of hits per query and jump past the other.
Important thing to note: The general principal is MORE COMPLEX than "find all pages containing this term". You can ADD terms and get MORE hits.
As an example and as a thing to keep in mind, witness:
Results 1 - 10 of about 298,000 for robot dance research
Results 1 - 10 of about 970,000 for robot dance research ME
xkcd.com - a webcomic of mathematics, love, and language.
DISCLAIMER: This comment is influenced by Colt 45 malt liquor...
Big deal about what some other corp. says. This is a Joe Schmoe study conducted by college students. This means they're an independent, non-funded (therefore non-corp influenced) study. Too bad they have seemingly been coerced into changing some things in their article. *sigh* Why can't they ever stick with their guns??
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
WTF is NCSA?
North Carolina Space Administration? (Well the one based in Florida isn't doing so good these days...
Northern Canada Soccer Association?
-
lawyer - results 29,300,000
-
lawyer lawyer - results 29,300,000
-
lawyer lawyer lawyer - results 62,000,000
-
lawyer lawyer lawyer lawyer - results 78,600,000
This trend appears to continue, as seen in that repeating the "lawyer" keyword 10 times results in Google estimating that there are 389,000,000 hits in it's index.On yahoo, this sort of thing doesn't seem to happen as much, but it still does happen. For example, searching for "laywer" returns 124,000,000 results, and searching for "lawyer lawyer" or "lawyer lawyer" returns 125,000,000 results.
So, it probably doesn't really make seen to judge the relative size of either index based on the estimated number of hits for any given set of keywords in their index. Right now, Google's numbers look a little more suspect since they seem to variety so greatly just based on the repetition of a keyword. However, the stability of Yahoo's numbers don't necessarily mean that they're correct either.
In contrast, Yahoo, unless I misunderstand, only compresses the file after it has been indexed. Since only the file is compressed and not the individual characters, they indeed have a larger index file as the study concluded. :)
After criticising the study when I first saw it, I now have some constructive ideas on how to perform a better test of the relative search performance of the two engines.
- Crawler Test
Do a search of "microsoft site:microsoft.com" in Google, Yahoo and MSN search. Assume that Microsoft knows how to crawl its own site completely and judge the relative strength of Google's and Yahoo's crawlers based on how many of those pages they find. Google easily wins.
Unfortunately, his test doesn't work with Yahoo as the reference site since Google returns no hits for "yahoo site:yahoo.com". This is very disappointing, as Yahoo is one of the largest sites on the Web. Neither Yahoo or MSN share this self-censorship policy.
Another test of the same kind is "amazon site:amazon.com", since amazon is a very big site which everyone is presumably very interested in crawling well. Unfortunately, amazon doesn't allow this kind of search on themselves via their A9 engine.
This is an interesting test because it compares the very likely actual size of a site (i.e. Microsoft's reported size for microsoft.com) with the reported size by the second-party search engine. This may be the best 3rd party test of a crawler's Recall on a per-site basis.
- Common Word Test
Surprisingly, Google, Yahoo and MSN all now allow stop-word searching. Stop words are words like "the", "a", "it", etc.. A search for "the" on each show Yahoo significantly in the lead.
This is an interesting test because the word "the" is probably as uniformly distributed in the source webpage population than the random phrases used in the NCSA test (or possibly moreso), plus this word is more likely than any other to occur in every web document (at least with English). These two characteristics mean that finding the most pages with "the" may be the closest approximation of an actual Recall measurement for all sites on the web that can be done without a prior-knowledge testing set.
- Conclusion
Google and MSN seem to have high-quality crawlers, whereas Yahoo makes up with a much larger current index size. However, take this with a grain of salt as it's hard to measure anything without knowing how the sites do Precision/Recall tradeoffs, and there may be substantial structural differences in the way the sites index pages to start with.
Even assuming these results, Yahoo's bigger index isn't necessarily more desirable. A better crawler can yield a bigger and better index with relatively little work (just add more machines to store it on), while there is no easy fix for the meticulous work needed to ensure a crawler is getting to all the nooks and crannies of the Web.
Longer-term, it is these caveats that make an Open-Source approach to crawling shine by comparisson. Take for example the Nutch search engine. Though it is in its early days, there need be no doubt to its cralwing and ranking algorithms as well as its Precision/Recall tradeoff.
Felonies for the whole lot of'em!
Oh, wait. Which students were these?
"Win treats sysadmins better than users. Mac treats users better than sysadmins. Linux treats everyone like sysadmins."
I got flamed for proposing this theory when the article was first posted on /.
One major problem with the study, not really addressed by this problems article is one of comparison. Yes, Yahoo! and Google are two search engines, but they perform their searches differently, and more importantly, use different criteria when returning matches! It is quite possible that when doing the exact same search in both yeilds a difference in results. Why? Because the two different search engines have different criteria for providing output, or matches if you will, from said search. Perhaps Yahoo! does indeed have more pages indexed, but because of their search algorithm, or their program which displays results to the user, less matches are provided; even though more pages were looked at.
I'm not saying that Yahoo! does have more or less pages than Google. But the study that was executed and published did not account for many of the differences between the products they were comparing. The above is merely another interpretation of said results. I'm glad that some folks at NCSA agree and provided some clarification; lest we get another urban myth like storks bring babies.
-Runz
lawyer lawyer - results 29,300,000
lawyer lawyer lawyer - results 62,000,000
lawyer lawyer lawyer lawyer - results 78,600,000
lawyer lawyer lawyer lawyer
lawyer lawyer lawyer lawyer
lawyer lawyer lawyer lawyer
LAW SUIT LAW SUIT!
lawyer lawyer lawyer lawyer ...
lawyer lawyer
Yeah, I see that now too. I must have mistyped. My apologies to Google for publicly questioning their editorial policy without merit!
I took a philosophy class with Matt Cheney at the University of Illinois. Let me just say for the record that he is a douchebag. I am really not surprised that he tried to pass off this study under the auspices of NCSA. I'm just glad to see that someone called him on this.
Comment removed based on user account deletion
I also think the GP should not have been modded down. However, I think the sentence would make much more sense without the error.
...seemed to contradict... ...that contradicted the fact... ...seemed to contradicted... is just plain wrong.
CSA has issued a strong disclaimer on the study announced recently on Slashdot that seemed to contradicted the fact that Yahoo's index size would be bigger than Google's
What about
or plain
but using
Google has great most of the web covered. While obeying robots.txt and such, they can't index much more of meaningful content. So how did Yahoo almost triple the Google's goal? Well, as long as you're looking for obvious stuff with "easy hits", the results will be similar. But if you enter REALLY obscure stuff, for which Google shows 3-5 hits, Yahoo will show the same 3-5 hits and 15 others, which are all different variants of 404, pages pointed to through broken links. Simply put, 2/3 of Yahoo index are "404 not found" pages, and that's how it gets such huge numbers...
45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2
Maybe Mozilla/Firefox could work with Google to implement this type of feedback system...
That already exists as stumbleupon.com and deli.cio.us.
well done!
Count the "coulds" and the "mights" in your post and agree with me, that NCSA's method can not be used to conclusively compare the sizes of Yahoo!'s and Google's indexes...
In Soviet Washington the swamp drains you.
search terms: fencing foil sabre timings milliseconds interval
motivation: the fencing federation recently changed the timings on the electronic scoring equipment.
yahoo: http://tinyurl.com/8tutq one page, but it's what I wanted.
google: seven pages, all junk. http://tinyurl.com/a9hd9