NCSA Issues Disclaimer on Google/Yahoo Study

Important detail in the study. by Anonymous Coward · 2005-08-22 04:21 · Score: 0

I'm not sure whether the claimed market penetration is really correct as it contradicts the Gartner studies from 2002 and 2004.

--
My dong vary long.

Disclaimer Text by Stanistani · 2005-08-22 04:24 · Score: 5, Interesting

From http://vburton.ncsa.uiuc.edu/indexsize.html:
"The following study was completed by two of Professor Vernon Burton's students at the University of Illinois. Though one of the students previously worked with Professor Burton at the National Center for Supercomputing Applications (NCSA), the study was done outside the scope of any NCSA core projects. When first published online, staff at the NCSA noted several issues with the study, and some revisions have been made to the document to reflect several of these concerns. Changes are detailed at the bottom of the following page.

Please note again that this study is not an NCSA publication and was not conducted as part of any NCSA project or under the supervision of NCSA.

A Comparison of the Size of the Yahoo and Google Indices "

--
You can't talk about Wikipedia's flaws on Wikipedia

Re:Disclaimer Text by Anonymous Coward · 2005-08-22 06:29 · Score: 0

In my opinion, they were just a couple of geeks trying to gather some attention from Google, in hope of a job offer ;) .. or maybe not. But their study is totally bias towards google.
Re:Disclaimer Text by klept · 2005-08-22 11:17 · Score: 1

Ok, so NCSA claims to not be associated with this paper, but several changes have been made to reflect several concerns of NCSA staff. So which is it, NCSA, are you involved or not? And from the fact that changes were made to reflect your concern, it sounds like you were involved. It also sounds like you pissed off Google.

... so? by DrEldarion · 2005-08-22 04:24 · Score: 2, Interesting

Okay, changes have been made to it, but the outcome is still the same. Why does this matter, then?

Re:... so? by Anonymous Coward · 2005-08-22 04:48 · Score: 2, Interesting

Preliminary results (from 7000 test queries) indicates that the results of this verification study confirms the conclusions of this study, but final results are still forthcoming.

Looks like they're still doing some looking to make sure their results are rock solid, but that so far they seem to be. As such, the current state of reality is that the fact is that Google has a must bigger index of the world wide web (or Internet, or whatever you want to call it) than Yahoo. Yahoo may have a bigger index squirreled away somewhere, but they are not making it available to the public via their search. As such, their CEO's (mis)statements about such are very much misleading and in dispute. Anyways, as anyone who searches knows, Google is better.*

--
* The preceding was my opinion of the facts. YMMV. Please search responsibly.
Re:... so? by mi · 2005-08-22 05:16 · Score: 1, Interesting

The whole method seems flawed. Trying to compare the sizes of two sets by the sizes of various subsets makes sense only if the method of selecting the subsets is the same.
This is not the case. The methods depend on each search engine's algorithms and are very likely to differ greatly.
In any case, whether a particular query returns 40 results or 40000 does not matter -- only the first 20 are ever of any use...

--
In Soviet Washington the swamp drains you.
Re:... so? by Anonymous Coward · 2005-08-22 06:17 · Score: 0

Well, their sorting algorithms might differ, but their filtering algorithms might not differ all that much. So I would hold back on the bold font.

Differences in filtering might be caused by differences in stemming and tokenization. That could be checked by looking at the search results and checking in the specific url's found in one search engine could be found in the other.

/. 503 error by dhasenan · 2005-08-22 04:25 · Score: 2, Interesting

Off topic...

Anyone else get 503 errors when trying to reach Slashdot?

Where do you go to talk about Slashdot being Slashdotted?

Re:/. 503 error by paulius_g · 2005-08-22 04:38 · Score: 2, Informative

Glad you asked...

I've been getting 500 errors the whole morning while trying to reach /. But not 503 ones. After one or two page refreshes, it starts working!

--
The hip way to get your IP. No ads, ever.
Re:/. 503 error by Anonymous Coward · 2005-08-22 04:47 · Score: 5, Funny

I've been getting 500 errors the whole morning while trying to reach /. But not 503 ones. After one or two page refreshes, it starts working!

The trick is to refresh as fast as you can, until the bad 500 errors go away.
Re:/. 503 error by paulius_g · 2005-08-22 04:50 · Score: 1

Yeah well, don't make me refresh YOUR site!

--
The hip way to get your IP. No ads, ever.
Re:/. 503 error by StarvingSE · 2005-08-22 05:03 · Score: 1

In Soviet Russia, the sites refresh you!

--
I got nothin'
Re:/. 503 error by paulius_g · 2005-08-22 05:08 · Score: 1

In Northern Siberia, you are refreshed until you are slashdotted!

--
The hip way to get your IP. No ads, ever.
Re:/. 503 error by Analog+Squirrel · 2005-08-22 05:34 · Score: 1

Yeah, I got one this morning, too - although it is a pretty rare occurance for me - and the thought of "slashdot getting slashdotted" did go through my mind... and I found a bit of preverse humor in it. :-)

--
I'd rather be flying
Re:/. 503 error by antdude · 2005-08-22 06:09 · Score: 1

I noticed their comments were not showing this morning either. I think they were related.

--
Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).
Re:/. 503 error by yiantsbro · 2005-08-22 06:52 · Score: 1

"Where do you go to talk about Slashdot being Slashdotted?"

First rule of /. --- you don't talk about /.

A crucial issue... by d3m057h3n35 · 2005-08-22 04:26 · Score: 5, Funny

Also pertinent was the discovery that Yahoo's claims to increased index size were based on the hope that buying products from companies which advertise "longer, thicker index size in two weeks, money-back guarantee, all-natural supplements" would yield actual results.

Wait... by lbmouse · 2005-08-22 04:28 · Score: 5, Funny

I thought that size didn't matter.

Re:Wait... by theotherlight · 2005-08-22 04:36 · Score: 1

That's only what /.'ers tell themselves to feel better...

The truth is: size is everything.

--
The cat's in the bag and the bag's in the river.
Re:Wait... by Anonymous Coward · 2005-08-22 04:49 · Score: 0

it's not the size of the index, but how you use it :)
Re:Wait... by thegamerformelyknown · 2005-08-22 05:06 · Score: 1, Funny

Actually, it's not how big it is, but what you do with it:)

I think that applies to both situations too...

--
Foxed Design
Re:Wait... by Analog+Squirrel · 2005-08-22 05:40 · Score: 1

"Both"? You mean there's more than 1?

--
I'd rather be flying
Re:Wait... by thegamerformelyknown · 2005-08-22 06:02 · Score: 0

I meant the topic at hand and the other one from which both comments are derived.

--
Foxed Design
Re:Wait... by chrisvdb · 2005-08-22 06:35 · Score: 1

I thought that size didn't matter.

That's what they say to you ;-)
Re:Wait... by Barkmullz · 2005-08-22 08:33 · Score: 1

I thought that size didn't matter.

It's not the wand, it's the wizard...

--
Ronald said nothing. He flung himself from the room, flung himself upon his horse, and rode madly off in all directions.

Hah, hah by Donny+Smith · 2005-08-22 04:31 · Score: 0, Flamebait

They probably almost got their ass sued, hah, hah...

They asked for it... Within days (ok, maybe weeks) of Yahoo's announcement they think up, prepare and conduct a "study". Riiight.

Unfortunately that's not a CVS tree that one can do updates and send diffs as they please.
And the bozos used the university site to publish such mambo-jumbo study. Very professional!

Re:Hah, hah by Fr05t · 2005-08-22 04:36 · Score: 1

"such mambo-jumbo study. Very professional!"

I'll have you know Mambo is very professional. If you ment Mumbo, then yeah those guys are a bunch of dead beats.
Re:Hah, hah by Anonymous Coward · 2005-08-22 06:43 · Score: 0

Are you drunk or stupid, or both?

Sued for what? They never made hard claims that far outside the limits of the study itself. If you have a problem with their methodology, that's fine, but you don't point them out at all. The point of publishing a study is for the review of that study; even well-run scientific studies over years done by the best have flaws (aka what do you think the discussion section of most research papers address?).

You suggest that they asked for it...So you're saying people should be afraid of publishing studies which question a large corporations claim? That suing them is a deserved outcome? You're a freaking loser if you believe this is the proper way to handle things.

The study was published shortly after the Yahoo announcement....so what? People think up, devise, and implement studies in a matter of seconds and minutes sometimes. What the supposed-now-non-NCSA study did wasn't genius. It was practical, simple, and to the point. Flawed imo, but it was NOT difficult to carry out. Or, maybe it was for someone of your (cough) caliber.

The NCSA simply wants to distance themselves for their reasons, deserved or not, from this study. I don't care what the real underlying reasons are myself, because I read the study for what it was and evaluated it ON MY OWN for MY OWN NEEDS.

But the fact that you, a person who neither contributes to the discussion or has anything of critical value to address the failings of the study, take such a litigious attitude and glea in the backpeddling by the NCSA is probably the real reason why NCSA has to bother with such distancing in the first place.

But why publish it? by ChrisF79 · 2005-08-22 04:34 · Score: 2, Insightful

Although they don't say it in the disclaimer, their actions of posting a disclaimer after posting the article screams that they realize the article is flawed. If that's the case, why publish it in the first place? Shouldn't they have had some foresight and left this one on the cutting room floor? Maybe Finance is different, but I remember it being very difficult to get an article published unless it was groundbreaking and free from any minor flaws.

--
Finance tutorials and more! Understandfinance

Re:But why publish it? by Anonymous Coward · 2005-08-22 04:41 · Score: 0

Uh, in this case, "publishing" == "put up a web page summarizing the results of an experiment".

And I'm pretty sure that's not very difficult to do, even for you finance guys. :-)

If you're looking for someone to have "foresight" and the ability to determine that work is free from minor flaws before publicizing it, then, well, don't come to slashdot for your news.
Re:But why publish it? by 'nother+poster · 2005-08-22 04:54 · Score: 3, Insightful

From the disclaimer I would say thet the report was not a university sanctioned project, but a funtime project for a couple of students. They then published it in a manner that implied that it was offical work of the university, or at least sanctioned by the professor. Now, whether the study is right or wrong come peer review, the university wants it known that it wasn't their project. A peer reviewed research project is much different than throwing together a bad stats class midterm and putting the results on a university server.
Re:But why publish it? by Anonymous Coward · 2005-08-22 05:16 · Score: 0

Hey, just out of curiousity, you don't happen to be in FINANCE, do you? I mean, I kinda get the feeling you might be in FINANCE. Maybe it's something about your comment, or maybe your sig that says, "Hey, I'm in FINANCE!" I mean, do you want people to know you're in FINANCE, because if you do, you might want to make it a little more clear you're in FINANCE.
Re:But why publish it? by hackstraw · 2005-08-22 05:45 · Score: 1

Although they don't say it in the disclaimer, their actions of posting a disclaimer after posting the article screams that they realize the article is flawed. If that's the case, why publish it in the first place? Shouldn't they have had some foresight and left this one on the cutting room floor? Maybe Finance is different, but I remember it being very difficult to get an article published unless it was groundbreaking and free from any minor flaws.

Yes, the web page was lacking in methodology and had a number of possible confounds, but it was just a web page. Unless I missed something it was not an article nor was it published, but rather one of millions of worthless web pages out there on the internet.

Even the slashdot summary states:

This study conducted by students is 'not an NCSA publication and was not conducted as part of any NCSA project or under the supervision of NCSA'. "

Maybe I'm underreacting.
Re:But why publish it? by monkeydo · 2005-08-22 09:18 · Score: 1

Yes, you are underreacting. Did you miss the original Slashdot posting:
NCSA Compares Google and Yahoo Index Numbers
chrisd (former Slashdot editor and now Google employee) writes "Recently, Yahoo claimed an increase of index size to "over 20 billion items", compared to Google's 8.16 billion pages. Now, researchers at NCSA have done their own, independent, comparison of the two engines. "

Notice that the summary was submitted by a well known Google employee, and that it states the study was conducted by the NCSA.

--
Si vis pacem, para bellum
The only thing more annoying than a Libertarian is an (un|mis)informed Libertarian

Filtering by Spazmania · 2005-08-22 04:35 · Score: 4, Insightful

Readers can consult the list of search terms provided by the authors, and can see for themselves that, in the vast majority of cases retained (i.e. those with fewer than 1000 results), the results in question are lists and spam.

I don't know which disturbs me more: The possibility that this is the correct explanation for the discrepancy or the possibility that it isn't.

It seems to me that the correct solution to filtering results would be to put the "undesirable" results at the bottom of the list, not get rid of them entirely. One man's trash is another man's treasure after all.

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.

Everybody seems to fear Google! by Anonymous Coward · 2005-08-22 04:38 · Score: 0

Everybody seems to fear Google now that it is sooo big.

The temptations to become more directive will begin to creep up!

A Study By Students... by __aaclcg7560 · 2005-08-22 04:39 · Score: 1

The fact that a study conducted by students got mention on /. is impressive. Usually, most works done by students are ignored as class exercises. Now "retracted" can be added to the list.

Re:A Study By Students... by Anonymous Coward · 2005-08-22 04:54 · Score: 0

Studies conducted by PhD students are published all the time. The wording of the disclaimer makes me think these were PhD or masters students.
Re:A Study By Students... by BitchKapoor · 2005-08-22 09:16 · Score: 1

This so-called study was thought up by a well-known campus political player who recently completed a masters in library science and programmed by a friend who graduated a while a go, I think with a bachelor's in CS. I bet the whole thing was hacked up to try to get Cheney a job at Google.

Covering Ones Rear by gkozlyk · 2005-08-22 04:39 · Score: 3, Insightful

Ah, the good old disclaimer added to cover ones rear. With litigation flying free as newspaper in the wind, one can't be to careful these days.

--

Re:Covering Ones Rear by Stalus · 2005-08-22 05:32 · Score: 1

What I really love is that fact that the page used to have the professor listed on the list of authors, NCSA logos were on the page, UIUC was listed under the authors' affiliation, and it looked much more official. Now that it's been aired out as non-scientific, there's all sorts of disclaimers saying that it was his student's work, and shifting the blame. Too bad it was published on his webspace :P

Perhaps the professor of History and Sociology will think twice next time before attempting to put his name on a study that should have been conducted by people outside of his field.

Why is the disclaimer needed? by frdmfghtr · 2005-08-22 04:48 · Score: 2, Interesting

I didn't read the article the first time around, so maybe something was changed/removed that prompted the disclaimer. I read the report and couldn't find a single reference to NCSA, except in the URL and in the disclaimer itself.

Aside from the URL, was there some sort of NCSA association implied or claimed in the original post then removed?

--
Government's idea of a balanced budget: take money from the right pocket to balance...oh who am I kidding?

Re:Why is the disclaimer needed? by Anonymous Coward · 2005-08-22 05:19 · Score: 0

Previous version said: Matthew Cheney, Mike Perry, and Dr. Orville Vernon Burton University of Illinois at Urbana-Champaign and the National Center for Supercomputing Applications

The dark web by SpinyNorman · 2005-08-22 04:49 · Score: 5, Insightful

The Yahoo vs Google page count methodology of counting numbers of pages returned for various high-response queries seems to be completely ignoring the fact that Yahoo *might be* picking up some of the less highly linked-to "dark web" that Google's page rank alogorithm are going to rate lowly, and which their crawler may be ignoring.

This is the portion of the web that I'd like to see - not the commerical portion but the hobbyist and enthusiast sites that may be out there without lots of incoming links that would make them more highly rated and/or visible to Google.

What'd therefore be relevant and interesting to know isn't how many hundreds of pages Google vs Yahoo get for "my job sucks", but rather how many it gets for "my weevil collection".

Re:The dark web by squoozer · 2005-08-22 05:07 · Score: 1

Your wish fulfilled:
Google: Approx 47,100
Yahoo: Approx 258,000

Both searches we for "my weevil collection" without quotes. With quotes the results are:

Google: 3
Yahoo: 4

Yahoo is champ.

--
I used to have a better sig but it broke.
Re:The dark web by zarr · 2005-08-22 05:11 · Score: 1

Search results for "weevil":
google - 915,000
yahoo - 2,200,000
Search results for "my weevil collection":
google - 3
yahoo - 4
Yahoo returned the wikipedia page for "weevil" on the first page, so now I know what it is :)
Re:The dark web by markov_chain · 2005-08-22 05:19 · Score: 2, Interesting

Interesting.

The original search rating papers (Kleinberg's algorithm, PageRank) made the ground-breaking observation that links between pages contained lots of useful information that could be used for ranking in addition to the keywords contained in the pages themselves. This was in a time where websites were mostly personal, and there was an atmosphere of friendly sharing of information where people would link to other sites that they find interesting. However, how much have things changed today? Who still has their own web page? Aren't links mostly commercial? Like you said, there may be valuable sites out there that are getting ignored because of over-reliance on links as source of reputation.

Maybe it's time for a "people who went to site A also went to site B" technology. It would require running a client-side traffic monitor that would build these adjacency lists and send them back. If it was open sourced and anonymous, the privacy concerns would be minimal, and it would provide a usage-based source of reputation.

--
Tsunami -- You can't bring a good wave down!
Re:The dark web by Anonymous Coward · 2005-08-22 05:42 · Score: 1, Funny

Search results for "weevil":
google - 915,000
yahoo - 2,200,000

Search results for "my weevil collection":
google - 3
yahoo - 4

You're getting negative hits?
Re:The dark web by MushMouth · 2005-08-22 05:47 · Score: 1

At one time that was google's 'pages like this one', and then there are Alexa's "Related Links", which have been around since before Google. Unfortunately there are privacy issues, and there would be (and is for alexa) a whole industry built around gaming that system.
Re:The dark web by markov_chain · 2005-08-22 06:11 · Score: 1

Good tip, thanks. The Alexa client is dead on. Abuse and privacy issues are inevitable, but I'm curious how a search engine using client-side information compares to a crawl based one.

--
Tsunami -- You can't bring a good wave down!
Re:The dark web by Epistax · 2005-08-22 06:15 · Score: 1

I have had a few items pop in my head which I thought were "totally awesome" and if successfully employed, worth many many monies.

As for the search engine, how about a little checkbox that says "No Business". What's a business? Someone who sells something (loosely). Anyway it'd take a heck of a lot of work to implement and define, but that's a checkbox I'd have thoroughly molested. I'd also want to make a thesaurus that edits words as you type.
Re:The dark web by dogod · 2005-08-22 06:34 · Score: 1

yahoo might have what your looking for, though it's only beta. they allow you to switch from more commercial or more informational sites (i.e., from academic, non-commercial, or research-oriented sources) i have yet to try it so i don't know how well it works.

http://research.yahoo.com/research/data_analytics/ mindset__intent-driven_search.shtml
Re:The dark web by RAMMS+EIN · 2005-08-22 07:19 · Score: 2, Insightful

``This is the portion of the web that I'd like to see - not the commerical portion but the hobbyist and enthusiast sites that may be out there without lots of incoming links that would make them more highly rated and/or visible to Google.''

I personally don't think Google is _excluding_ pages that somehow don't get enough links to them. Typically, good resources will get linked to, and thus taking into account the number of links to a page seems sensible.

From personal experience, I can't say I have anything to complain about with Google. When I post a new page on my site that includes some word that previously had few hits on Google, it gets to the top of the results within a few days. So, even without many links, the system works. When I search for words that do return many hits, the results I get first are usually the most relevant (provided that I have entered enough words to place everything in proper context; searching for "festival" wouldn't give me the speech synthesis software unless I also included "speech").

If you are specifically looking for pages that have few links to them, another search engine might be better for you. Or maybe not. Maybe you would be best served by using Google and looking at the last rather than first results. Perhaps it would be a good idea for Google to include an option to invert the ranking?

--
Please correct me if I got my facts wrong.
Re:The dark web by illumin8 · 2005-08-22 07:25 · Score: 1

This is the portion of the web that I'd like to see - not the commerical portion but the hobbyist and enthusiast sites that may be out there without lots of incoming links that would make them more highly rated and/or visible to Google.

Dude, if your idea of the dark web is millions of spam infested blog pages that have been crawled by a million spam robots putting links in the comment pages, along with a smattering of Mediawiki sites that have similarly been "0wnz0r3d" by spam crawlers that edit the pages and fill it full of links to online casinos, online pharmacies, and penis enlargement pills, more power to you...

I'll leave the "dark" web where it belongs... in the dark.

--
"When the president does it, that means it's not illegal." - Richard M. Nixon
Re:The dark web by alienw · 2005-08-22 07:44 · Score: 1

What makes you think your method is any better? It would be gamed just like PageRank (much worse, actually). Overreliance on any single method is not good if you want to have a decent search engine.
Re:The dark web by markov_chain · 2005-08-22 07:50 · Score: 1

It uses a different source of reputation, one that seems more in tune with what the Web content looks like today.

There is always the issue of abuse, no matter what the method.

--
Tsunami -- You can't bring a good wave down!
Re:The dark web by SpinyNorman · 2005-08-22 09:36 · Score: 1

Interesting idea. I imagine that Google would have the bandwidth and server capacity to capture and processs this data if browsers were able to make it available.

I quite often find that Amazon's "people who bought this book also bought/viewed ..." section turns up useful stuff that a title search doesn't, so I expect the same may be true here too. One could even get a "user interest rating" of pages by how long they viewed them for...

Maybe Mozilla/Firefox could work with Google to implement this type of feedback system...

Maybe those pages never were crawled by yahoo. by MushMouth · 2005-08-22 04:51 · Score: 4, Interesting

It is entirely possible that Yahoo not only crawls more content on a whole, but the percentage of the content that they crawl that is spam is smaller. Crawlers get to spam/SEO pages via other spam/SEO pages, thus if they have better filtering mechanisms they may simply stop earlier and put their (not completely unlimited) resources into crawling more useful data.

Re:Maybe those pages never were crawled by yahoo. by rbarreira · 2005-08-22 04:57 · Score: 1

Have you read the study?

--

The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
Re:Maybe those pages never were crawled by yahoo. by Alomex · 2005-08-22 09:18 · Score: 1

No it isn't. In principle yes, but I tested for this, and it is Yahoo the one that returns more spam:

http://slashdot.org/comments.pl?sid=159703&cid=133 74598

http://slashdot.org/comments.pl?sid=158453&cid=132 75737

cm'on, mods! by Anonymous Coward · 2005-08-22 04:51 · Score: 0

This text is the content of the first link of the news post. It's fucking REDUNDANT, not interesting.

Duh by Anonymous Coward · 2005-08-22 04:54 · Score: 0

I've been getting 500 errors the whole morning while trying to reach /. But not 503 ones.

If you want a 503 error, try it 3 more times.

Trash and treasure. by solomonrex · 2005-08-22 04:57 · Score: 1

I think the point is that copies of the ispell dictionary and spam are repetitive, which are normally not included with search results. Why do you need more than one copy of an identical result?

trust by dioscaido · 2005-08-22 05:01 · Score: 4, Funny

If it made it through the Slashdot filters, then the study is good enough for me.

Re:trust by Anonymous Coward · 2005-08-22 06:01 · Score: 0

Yup, it will surely make it through the filter again soon.

More dupes at 10!

It was not "published" by kaan · 2005-08-22 05:04 · Score: 4, Insightful

why publish it in the first place?

Dude, it was never published, it was posted on one web server that is part of the ncsa.uiuc.edu sub-domain (specifically, vburton.ncsa.uiuc.edu). There are probably hundreds of machines that are in this network, and posting something on a web server running there does not equate to NCSA formally publishing an article. What we're talking about here is a web page written by two students, they worked on a project, they wanted to post it for other people to see. So that's what they did, period.

Stupidly, everyone is claiming that NCSA backed this whole thing, like they (NCSA) are on some crusade to compare Yahoo and Google. But this must be taken for what it is - a project by two students. NCSA's disclaimer is just trying to make this clear for the idiots out there who think that every little thing a student says or does must have been funded, supported, backed, etc. by NCSA.

Yahoo is also better at foreign language stuff by Anonymous Coward · 2005-08-22 05:12 · Score: 0

If I am searching japanese or chinese content, I'd go to Yahoo instead of Google.

what by Anonymous Coward · 2005-08-22 05:14 · Score: 0

what

disclaimer: people believe everything they read by kaan · 2005-08-22 05:18 · Score: 1

I think you're right, the original article had no visible association with NCSA other than the url. But this is just like the classic telephone game: I tell you something, you repeat it to somebody else with a minor addition/change, then that person tells somebody else, etc. By the time it goes 4 or 5 hops, it's been totally twisted around, and my original message has turned into something idiotic, and everyone thinks I said it. This is exactly what happened here, because it started showing up on blogs, and then news sites started writing about it.

Overall, everyone was more or less accurate with regard to the articles details and results, etc., but the fact that this was just a single web page posted on a single web server in the ncsa.uiuc.edu subdomain was lost on everyone. People did not carry that important detail along, and over time it morphed into something else. Pretty quickly, we started seeing articles like, "NCSA Compares Google and Yahoo Index Numbers" appear on slashdot, which is hugely popular, and suddenly the whole world thought that the National Center for Supercomputing Applications was on a crusade to figure out which search engine is better. Hence the disclaimer from NCSA to formally tell the world that this "article" was "published" by two students, and that's all.

Yes, but... by Anonymous Coward · 2005-08-22 05:20 · Score: 0

Your wish fulfilled:
Google: Approx 47,100
Yahoo: Approx 258,000

Since both sites will only give access to the first 1000 hits (a major gripe of mine), WTF good is it?

I want to be able to see hit # 441,874,356 if I want. I may not be looking for the winner of a popularity contest.

Never mind that, look how low your ID is :-o by Low+Slashdot+ID+Guy! · 2005-08-22 05:26 · Score: 1

I, mean, like, cool, dude!

--
Ooh, you have a low Slashdot ID, yes you do, ooh!

How many times do Ihave to tell you... by Anonymous Coward · 2005-08-22 05:27 · Score: 0

Don't link to slashdot!!!!

It leads to the nerd wannabes (85% of /.'s audience) clicking th elink, making slashdot slashdot itself.

Link to the Goooooogle cache instead. I mean, nobody will miss Gooooooooogle when it gets slashdotted!

Slashdot - News about Google - Google Matters by Anonymous Coward · 2005-08-22 05:27 · Score: 0

FFS!

Shut up about Google already!

Google (the 'Do No Evil' tm.) Corp is becoming as bad as MS as every day goes by. I predict they will get even worse before trends and hubris brings them down.

Whitness the A and B share debacle where the Google Gods get 10x more voting rights than Joe Shareholder.

Google sucks sucks sucks!!!

Re:"Editor" seemed to contradicted someone's skill by ryanov · 2005-08-22 05:37 · Score: 1

I don't see why this was modded down. It really took away from the summary. Granted, the sentence makes no sense WITHOUT the error.

Hit volume is a minor factor by Council · 2005-08-22 05:42 · Score: 1

I know it's been said before, but you cannot just measure search engines based on volume of hits returned. Clearly, when you get into the millions, it doesn't hurt the results to prune some crap off the end, and I'm sure they're both doing things -- either one could easily focus a little on breadth of hits per query and jump past the other.

Important thing to note: The general principal is MORE COMPLEX than "find all pages containing this term". You can ADD terms and get MORE hits.

As an example and as a thing to keep in mind, witness:

Results 1 - 10 of about 298,000 for robot dance research
Results 1 - 10 of about 970,000 for robot dance research ME

--
xkcd.com - a webcomic of mathematics, love, and language.

Re:Hit volume is a minor factor by Council · 2005-08-22 10:10 · Score: 1

Though note that I'm not really referring to 'index size' but to the size of the list of hits returned for a term. I'm pretty much in favor of the indexes being as large as possible, and that's a reasonable thing to demand. But saying that one engine returns 2,000,000 more hits for 'banana store' than the other is not measuring the same thing at all, and is in fact dumb.

--
xkcd.com - a webcomic of mathematics, love, and language.

I say so what.... by Khyber · 2005-08-22 05:43 · Score: 1

DISCLAIMER: This comment is influenced by Colt 45 malt liquor...

Big deal about what some other corp. says. This is a Joe Schmoe study conducted by college students. This means they're an independent, non-funded (therefore non-corp influenced) study. Too bad they have seemingly been coerced into changing some things in their article. *sigh* Why can't they ever stick with their guns??

--
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.

Re:I say so what.... by Anonymous Coward · 2005-08-22 06:07 · Score: 0

Yes, they added some true information to clear up a popular misconception. How awful. If only they would stick to their guns!

NCSA? by Anonymous Coward · 2005-08-22 05:53 · Score: 0

WTF is NCSA?

North Carolina Space Administration? (Well the one based in Florida isn't doing so good these days...
Northern Canada Soccer Association?

Re:NCSA? by gauauu · 2005-08-22 13:04 · Score: 1

National Center for Supercomputing Applications, a research branch of the University of Illinois at Urbana-Champaign.

Their claims to fame include having four supercomputers of the 50 fastest in the world, and creating Mosaic, the first graphical web browser.

If you have doubts about the influence of Mosaic, load up internet explorer, click "Help" in the menu, then click "About Internet Explorer" and read the blurb....

Accuracy of Google counts? by xiaomonkey · 2005-08-22 05:58 · Score: 5, Interesting

Try the following sets of key words on Google:

lawyer - results 29,300,000
lawyer lawyer - results 29,300,000
lawyer lawyer lawyer - results 62,000,000
lawyer lawyer lawyer lawyer - results 78,600,000

This trend appears to continue, as seen in that repeating the "lawyer" keyword 10 times results in Google estimating that there are 389,000,000 hits in it's index.

On yahoo, this sort of thing doesn't seem to happen as much, but it still does happen. For example, searching for "laywer" returns 124,000,000 results, and searching for "lawyer lawyer" or "lawyer lawyer" returns 125,000,000 results.

So, it probably doesn't really make seen to judge the relative size of either index based on the estimated number of hits for any given set of keywords in their index. Right now, Google's numbers look a little more suspect since they seem to variety so greatly just based on the repetition of a keyword. However, the stability of Yahoo's numbers don't necessarily mean that they're correct either.

Re:Accuracy of Google counts? by xiaomonkey · 2005-08-22 06:10 · Score: 1

Sorry, the yahoo links I gave above are erroneous.

Here's the corrected version of the first one, "lawyer" that results in 125,000,000 many estimated hits. The second one, "lawyer lawyer" results in 124,000,000 many estimated hits.
Re:Accuracy of Google counts? by Anonymous Coward · 2005-08-22 06:16 · Score: 0

Estimates on boolean queries are the only thing you can get. A 1% variation as for Yahoo is normal. Doubling the results (or more) as for Google is beyond estimation errors.
Re:Accuracy of Google counts? by rbarreira · 2005-08-22 06:30 · Score: 1

Those estimates are pretty irrelevant for this discussion, I think. When there are many results, those estimates aren't supposed to be accurate at all, that's why the study focused on queries with very few results.

But yes, those numbers you show are quite strange.

--

The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
Re:Accuracy of Google counts? by Sique · 2005-08-22 06:33 · Score: 1

For the search terms bla, bla bla, bla bla bla etc.pp. the numbers at Google remain pretty stable (starting out at 2.040 mio, and later somewhere between 1.870 und 1.900 mio).

So the interesting question is: Why does it work with lawyer, but not with bla?

--
.sig: Sique *sigh*

This may be true by lcsjk · 2005-08-22 06:22 · Score: 2, Funny

I understand that Google uses a very efficient compression technology to compress documents before they are indexed, thereby making characters so small that they can only be read with a magnifying glass or microscope.

In contrast, Yahoo, unless I misunderstand, only compresses the file after it has been indexed. Since only the file is compressed and not the individual characters, they indeed have a larger index file as the study concluded. :)

More thoughts on a better test by freality · 2005-08-22 06:23 · Score: 4, Interesting

After criticising the study when I first saw it, I now have some constructive ideas on how to perform a better test of the relative search performance of the two engines.

- Crawler Test

Do a search of "microsoft site:microsoft.com" in Google, Yahoo and MSN search. Assume that Microsoft knows how to crawl its own site completely and judge the relative strength of Google's and Yahoo's crawlers based on how many of those pages they find. Google easily wins.

Unfortunately, his test doesn't work with Yahoo as the reference site since Google returns no hits for "yahoo site:yahoo.com". This is very disappointing, as Yahoo is one of the largest sites on the Web. Neither Yahoo or MSN share this self-censorship policy.

Another test of the same kind is "amazon site:amazon.com", since amazon is a very big site which everyone is presumably very interested in crawling well. Unfortunately, amazon doesn't allow this kind of search on themselves via their A9 engine.

This is an interesting test because it compares the very likely actual size of a site (i.e. Microsoft's reported size for microsoft.com) with the reported size by the second-party search engine. This may be the best 3rd party test of a crawler's Recall on a per-site basis.

- Common Word Test

Surprisingly, Google, Yahoo and MSN all now allow stop-word searching. Stop words are words like "the", "a", "it", etc.. A search for "the" on each show Yahoo significantly in the lead.

This is an interesting test because the word "the" is probably as uniformly distributed in the source webpage population than the random phrases used in the NCSA test (or possibly moreso), plus this word is more likely than any other to occur in every web document (at least with English). These two characteristics mean that finding the most pages with "the" may be the closest approximation of an actual Recall measurement for all sites on the web that can be done without a prior-knowledge testing set.

- Conclusion

Google and MSN seem to have high-quality crawlers, whereas Yahoo makes up with a much larger current index size. However, take this with a grain of salt as it's hard to measure anything without knowing how the sites do Precision/Recall tradeoffs, and there may be substantial structural differences in the way the sites index pages to start with.

Even assuming these results, Yahoo's bigger index isn't necessarily more desirable. A better crawler can yield a bigger and better index with relatively little work (just add more machines to store it on), while there is no easy fix for the meticulous work needed to ensure a crawler is getting to all the nooks and crannies of the Web.

Longer-term, it is these caveats that make an Open-Source approach to crawling shine by comparisson. Take for example the Nutch search engine. Though it is in its early days, there need be no doubt to its cralwing and ranking algorithms as well as its Precision/Recall tradeoff.

Re:More thoughts on a better test by Anonymous Coward · 2005-08-22 06:34 · Score: 0

No hits for "yahoo site:yahoo.com"?
Really?

Results 1 - 10 of about 47,100,000 from yahoo.com for yahoo.
Re:More thoughts on a better test by knoebelsPT · 2005-08-22 06:37 · Score: 1

What?

A google search for yahoo site:yahoo.com turns up over 57,000,000 hits, not zero.
Re:More thoughts on a better test by Alomex · 2005-08-22 09:10 · Score: 1

A search for "the" on each show Yahoo significantly in the lead.

The problem with such searches is that if a search engine misindexes a mirror site or a 401 page and that is returned in the count, then that SE looks bigger.

On the other hand if you launch a query that has 5-10 answers you can actually examine every single page on both result pages and make sure that all hits are correct and distinct.

Using that technique Google comes ahead of Yahoo, by a large margin.
Re:More thoughts on a better test by Esteanil · 2005-08-22 11:19 · Score: 1

Surprisingly, Google, Yahoo and MSN all now allow stop-word searching. Stop words are words like "the", "a", "it", etc.. A search for "the" on each show Yahoo significantly in the lead.

From the article:

Interestingly, the actual total number of results returns varies dramatically from the estimated total number of results that both Google and Yahoo! provide users in the search results. In the case of Google, the number of actual results returned is about one third of the estimation that Google gives. However, in the case of Yahoo! the actual number of search results returned is less than one sixth the estimated total.

--
I'm a dreamer, the world is my playpen. But hey, I'm a serious person, I can't dream all the time.
Re:More thoughts on a better test by freality · 2005-08-22 14:08 · Score: 1

Yeah, I agree there are many caveats. Like I said: "However, take this with a grain of salt as it's hard to measure anything without knowing how the sites do Precision/Recall tradeoffs, and there may be substantial structural differences in the way the sites index pages to start with."

What you describe would be to me a "substantial structural difference". Which means I agree.

However, that doesn't change that I do think it's better to accept probable error in a huge population of samples than to choose a miniscule population which is still susceptible to error, as the NCSA study seems to have favored.

The fundamental problem is that if you care about the actual performance of Google and Yahoo, you're left with very little to judge them by.

It's better, as usual, to support open approaches like Apache Nutch.
Re:More thoughts on a better test by freality · 2005-08-22 14:23 · Score: 1

Well, yeah.

But even though I said take everything I said with a grain of salt, I would take that claim about estimated vs. actual hits with 2 grains of salt.

Here's my superb rationale why.

Consider that if it was easy or efficient return the exact # of hits, they would probably do it, instead of, for instance, "Results 251 - 259 of about 284". I mean, consider the UI people at Google.. do they want to muddly their otherwise famously precise and clear interface with a guffaw like that? I'd bet Not unless they couldn't do any better or unless there's a good reason to not show the full 284.

My bet is that they tune the results estimate to be basically precise for an average query, and not a junk query.

However, there is a more tantilizing possibility. What if the "of about X" means that X is the total # of pages that *do* match the query, but some repeats or otherwise bogus pages are thrown out.

This would be exactly the evidence needed of their precision/recall tuning. In that case, which of course may be wrong, not only does Yahoo have a huge index, but is leveraging its size to be able to tune for a much higher precision while still returning a large number of results.

That's one possibility it would have been nice to see mentioned in the NCSA study.
Re:More thoughts on a better test by freality · 2005-08-22 14:27 · Score: 1

Actually, glad you brought this up. I turned off duplicate elimination on both sites for my test query and got an exact match between Yahoo's estimated and actual number of results pages. Google's actual number increased but still fell short of actual.

So again, looks like Yahoo has a good handle on what its index size is, but is simply filtering out lots of the results from its index. Perhaps there's not really that much difference between the two after all :)

Darn students... by jpellino · 2005-08-22 06:28 · Score: 1

Felonies for the whole lot of'em!

Oh, wait. Which students were these?

--
"Win treats sysadmins better than users. Mac treats users better than sysadmins. Linux treats everyone like sysadmins."

Study: Red Delicious apples != Fuji apples by RunzWithScissors · 2005-08-22 06:32 · Score: 3, Interesting

I got flamed for proposing this theory when the article was first posted on /.

One major problem with the study, not really addressed by this problems article is one of comparison. Yes, Yahoo! and Google are two search engines, but they perform their searches differently, and more importantly, use different criteria when returning matches! It is quite possible that when doing the exact same search in both yeilds a difference in results. Why? Because the two different search engines have different criteria for providing output, or matches if you will, from said search. Perhaps Yahoo! does indeed have more pages indexed, but because of their search algorithm, or their program which displays results to the user, less matches are provided; even though more pages were looked at.

I'm not saying that Yahoo! does have more or less pages than Google. But the study that was executed and published did not account for many of the differences between the products they were comparing. The above is merely another interpretation of said results. I'm glad that some folks at NCSA agree and provided some clarification; lest we get another urban myth like storks bring babies.

-Runz

Re:Accuracy of Google counts? (oblig.) by CycleMan · 2005-08-22 06:36 · Score: 2, Funny

lawyer - results 29,300,000
lawyer lawyer - results 29,300,000
lawyer lawyer lawyer - results 62,000,000
lawyer lawyer lawyer lawyer - results 78,600,000

lawyer lawyer lawyer lawyer
lawyer lawyer lawyer lawyer
lawyer lawyer lawyer lawyer
LAW SUIT LAW SUIT!

lawyer lawyer lawyer lawyer
lawyer lawyer ...

Yahoo hits.. my error. by freality · 2005-08-22 06:40 · Score: 1

Yeah, I see that now too. I must have mistyped. My apologies to Google for publicly questioning their editorial policy without merit!

Re:Yahoo hits.. my error. by Anonymous Coward · 2005-08-22 09:03 · Score: 0

Just edit the original posting like other sites let you... erh... never mind /. runs under the *nix philosophy that "users ought to be perfect or else be severely punished".

Matt Cheney is a punk by vishesh · 2005-08-22 08:33 · Score: 1

I took a philosophy class with Matt Cheney at the University of Illinois. Let me just say for the record that he is a douchebag. I am really not surprised that he tried to pass off this study under the auspices of NCSA. I'm just glad to see that someone called him on this.

Comment removed by account_deleted · 2005-08-22 08:36 · Score: 1

Comment removed based on user account deletion

Re:"Editor" seemed to contradicted someone's skill by Anonymous Coward · 2005-08-22 08:41 · Score: 0

I also think the GP should not have been modded down. However, I think the sentence would make much more sense without the error.

CSA has issued a strong disclaimer on the study announced recently on Slashdot that seemed to contradicted the fact that Yahoo's index size would be bigger than Google's

What about ...seemed to contradict...
or plain ...that contradicted the fact...
but using ...seemed to contradicted... is just plain wrong.

Yahoo bigger, how? by SharpFang · 2005-08-22 09:07 · Score: 1

Google has great most of the web covered. While obeying robots.txt and such, they can't index much more of meaningful content. So how did Yahoo almost triple the Google's goal? Well, as long as you're looking for obvious stuff with "easy hits", the results will be similar. But if you enter REALLY obscure stuff, for which Google shows 3-5 hits, Yahoo will show the same 3-5 hits and 15 others, which are all different variants of 404, pages pointed to through broken links. Simply put, 2/3 of Yahoo index are "404 not found" pages, and that's how it gets such huge numbers...

--
45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2

Feedback system by harmonica · 2005-08-22 12:17 · Score: 1

Maybe Mozilla/Firefox could work with Google to implement this type of feedback system...

That already exists as stumbleupon.com and deli.cio.us.

Re:First by Anonymous Coward · 2005-08-22 16:07 · Score: 0

well done!

"Could" & "might" (Re:... so?) by mi · 2005-08-23 00:46 · Score: 1

Count the "coulds" and the "mights" in your post and agree with me, that NCSA's method can not be used to conclusively compare the sizes of Yahoo!'s and Google's indexes...

--
In Soviet Washington the swamp drains you.

Google presents 7 docs, yahoo 1, on a real query.. by Anonymous Coward · 2005-09-04 16:44 · Score: 0

search terms: fencing foil sabre timings milliseconds interval

motivation: the fencing federation recently changed the timings on the electronic scoring equipment.

yahoo: http://tinyurl.com/8tutq one page, but it's what I wanted.

google: seven pages, all junk. http://tinyurl.com/a9hd9

Slashdot Mirror

NCSA Issues Disclaimer on Google/Yahoo Study

118 comments