NCSA Compares Google and Yahoo Index Numbers

Yahoo pants down, egg on face, no WMD either. by ackthpt · 2005-08-15 06:12 · Score: 3, Interesting

So the summary is in all but 3% of the time, Yahoo finds less pages than Google and that 18 bi1110nz Mayer claimed are a number he pulled right out of his own arse.

Honestly, when I first heard the news over the weekend I thought "rubbish, they must be ignoring requests for spiders to go no further or something." I guess NCSA can either 1) Expect no gifts from Yahoo OR 2) Report significantly different results after a sizable gift to NCSA.

75% less truth than other leading brand

--

A feeling of having made the same mistake before: Deja Foobar

Re:Yahoo pants down, egg on face, no WMD either. by Iriel · 2005-08-15 06:24 · Score: 3, Interesting

I think it is possible that Yahoo! has more items indexed than Google. It may not be true after all, but one has to give thought to the fact that Yahoo can search subscription based content. That has got to boost their numbers considerably beyond the range of queries that typically return less than one thousand results. It's possible that Yahoo! could have simply been fudging the numbers to get some press now that they're actually starting to get noticed again. I can't make a certain conjecture in either direction, but don't totally discredit Yahoo! without looking into everything.

--
Perfecting Discordia
www.stevenvansickle.com
Re:Yahoo pants down, egg on face, no WMD either. by Sandor+at+the+Zoo · 2005-08-15 06:39 · Score: 1, Insightful

Yeah, this "study" seems to be something whipped together over a weekend. Particularly:
Thus, for the purposes of this study, we were forced to restrict our searches to those queries that returned less than 1,000 results on both Yahoo! and Google.
So, anything popular gets tossed. What if Yahoo! indexes all the pages with popular search terms, but Google only indexes the first 1,000? I doubt very much that it's the case, but this whole approach seems suspect at best.
They threw out what is probably a huge chunk of the results they got, didn't tell us (that I can find) how much they threw out, then make conclusions based on the small sample left over. Seems like a very odd research method.
Re:Yahoo pants down, egg on face, no WMD either. by Anonymous Coward · 2005-08-15 06:41 · Score: 0

NCSA is well funded yo, they won't easily be bribed when they can play around with their high end PowerMacs.
Re:Yahoo pants down, egg on face, no WMD either. by Anonymous Coward · 2005-08-15 06:53 · Score: 0

Why on earth do you think that pages are indexed according to what search terms they match? They aren't!
This idea of yours that only the N first pages to match a search term, or collection of search terms, are indexed only exists in your imagination. You will understand why if you read up a bit on Information Retrieval techniques and then give the problem of searching the web some actual thought.
What search engines generally DO care about is document uniqueness, which is a tricky business and a really hard problem to solve well for large a huge and unwieldly corpus like the web.
Shame on you.
Re:Yahoo pants down, egg on face, no WMD either. by loose_cannon_gamer · 2005-08-15 06:54 · Score: 5, Insightful

After reading half the comments on this page, I'm amused at how many alert readers are making the same mistake that they accuse Yahoo of -- misstating results.
Can we conclude from this study that Google has a bigger index than Yahoo? No. Can we conclude that when you pick two English words that when entered into both Google and Yahoo, both return less than 1000 results, that Google has consistently more results? Yes.
The real question is, what can we infer from the actual indisputable findings of this study? I find no ready method of generalization. If you are inclined to believe google is better, you feel happy inside. If you think yahoo is better, you have many options to dispute the idea that the study result generalizes to search engine index size.
As a google fan, I enjoy the warm fuzzies, but I don't see that much to get excited about either way.

--
In Soviet Russia, us are belong to all your base.
Re:Yahoo pants down, egg on face, no WMD either. by telecsan · 2005-08-15 06:59 · Score: 0, Redundant

Exactly. I wouldn't judge the effectiveness of a search engine by the number of results returned for 2 randomly chosen dictionary words, let alone only those pairs which returned 1000 results.

Besides, who ever said more results were better? I mean, yeah, your car can have 400HP, but that doesn't change the fact that there are still stoplights every 50 ft. Give me a car with a remote to change the light to green, and now we're talking! They should focus on making the search results better, not larger.
Re:Yahoo pants down, egg on face, no WMD either. by Doc+Ruby · 2005-08-15 07:21 · Score: 1

RTFA. Or use the Yahoo methodology: claim results without actually reading webpages.

--
--
make install -not war
Re:Yahoo pants down, egg on face, no WMD either. by icemann476 · 2005-08-15 07:49 · Score: 2, Insightful

Actually, their decision to throw out any queries resulting in more than 1000 pages returned seems very logical to me. How many times have you typed in a search, perused through the 1000 pages and felt like you just needed more options? Yahoo may very well have more than double the total indexed pages Google has but what good is it to have 10,000 pages returned for 1 query; it becomes redundant at some point. I think the research did a good job of showing that Google produces more options (indexed pages) per search than Yahoo does, regardless of who actually has more "total pages" indexed.
Re:Yahoo pants down, egg on face, no WMD either. by Anonymous Coward · 2005-08-15 08:15 · Score: 0

Besides, who ever said more results were better?

I guess that is what Yahoo (and perhaps Google) both imply.
Re:Yahoo pants down, egg on face, no WMD either. by Iriel · 2005-08-15 08:16 · Score: 2, Insightful

I agree on that. Based on the methods used to test a general index size, I think it leaves a lot of holes. When you're talking about millions of items, a generalization can be woefully innacurate.

Rather than talking about indexed content, it seems like this test is actually more appropriate to use as some sort of analysis on the overall usefullness of the search engines. Even then, though, the results could be skewed to say that it's better to provide a wealth of pages (Google) or to have fine tuned and narrowed results that you're looking for (Yahoo!). Numbers matter to a program, results matter to people. This test only portrays the former, yet the latter is what we're really trying to get at.

Either way, I don't think radom tests can really do justice to Google or Yahoo!. Rather than perfomring a radomized test upon each, I think the better gauge of each's usefullness would be something more like a practical application study. In other words, evaluate real everyday kind of searches on each site instead of an unlikely combination of two random english words like politics and truth ;)

In other words, while I commend the effort to debunk any misinformation about which search engine is better endowed, so to speak; the numbers given don't provide useful information to anyone but a spin doctor.

(As a side note, I'm actually more of a Google fan for search and applications, but I love Yahoo! as a lifestyle portal for things like movie listings and such)

--
Perfecting Discordia
www.stevenvansickle.com
Re:Yahoo pants down, egg on face, no WMD either. by okayplayer · 2005-08-15 08:38 · Score: 2, Funny

Did you not read the article a couple of days ago abou those "remotes" becoming federally illegal?

--
What a horrible thing the ESRB just did to the game industry.
Re:Yahoo pants down, egg on face, no WMD either. by PDAllen · 2005-08-15 08:59 · Score: 1

While I suspect in this case the situation is that Yahoo are talking crap, I'd observe that the study is flawed.

It assumes that if a page is indexed by both Google and Yahoo then a given search string will either return the page in both search engines or in neither; this is not necessarily true. Both Google and Yahoo can find pages which do not contain a search string (e.g. 'failure' famously finds GWB's bio, a page which doesn't contain 'failure'). If Google's indexing algorithm were more generous than Yahoo's with this sort of thing then you'd expect to find Google returning more results for a given search even if Yahoo had indexed more pages. Extreme example: I set up PDAllenSearch which has indexed only a few million pages, but returns any page containing the search string or directly linked to one which does. That probably returns more results than Google for most queries, and with a sensible ranking algorithm it won't look too bad.
Re:Yahoo pants down, egg on face, no WMD either. by dustmite · 2005-08-15 09:18 · Score: 2, Insightful

I think it is possible that Yahoo! has more items indexed than Google ... Yahoo can search subscription based content. That has got to boost their numbers considerably beyond the range of queries that typically return less than one thousand results
If we assume that Yahoo has offered subscription-based content searching for about two years (not sure of the exact length of time), then to get even close to the difference they are citing here in their marketing (over 11 billion more items), they would have to have added over 116 subscription-based items per second, every single second since they started. This seems rather unlikely. Far far far more likely is that this is just a case of extremely "creating (ac)counting" on Yahoo's part.
Re:Yahoo pants down, egg on face, no WMD either. by barawn · 2005-08-15 09:31 · Score: 1

Can we conclude that when you pick two English words that when entered into both Google and Yahoo, both return less than 1000 results

Two random words.

The study essentially is showing that there are more dictionary lists in Google than there are in Yahoo. The vast majority of the included searches return few (~10 ish) results, and they're extremely similar results each time. So that result gets magnified a ton.

Now, the interesting thing is that if you look at results which return between 100 and 1000 results (i.e. not dictionary list results) you get a similar conclusion (but with much, much lower statistics! ~300 vs 10,000). Which in some way, makes sense - there's nothing special about the sample of dictionary list sites, and so if there are more of them on Google, there's likely to be more of the normal sites, too. But the original study can't really even make the claim that it's making.
Re:Yahoo pants down, egg on face, no WMD either. by NickFortune · 2005-08-15 09:33 · Score: 1

So google performs better for hard to find inform
Can we conclude from this study that Google has a bigger index than Yahoo? No.
To be fair, all they conclude is that Yahoo's claims seem "suspicious" in light of their findings. Can we conclude that when you pick two English words that when entered into both Google and Yahoo, both return less than 1000 results, that Google has consistently more results? Yes.
mmm... and therefore? I think there's suppsoed to be some critical evaluation of the exprimental data.
For example, is this the expected result, given that Yahoo claims indices two and a half times larger than googles? If we accept Yahoo's claim, what does this say about the quality of their indices? Or should we perhaps conclude that index size isn't a reliable metric by which to evaluate search engine usefulness?
Certainly, it seems that Google returns more information in those cases where information is scarce. And arguably, that's the most important case.
As a google fan, I enjoy the warm fuzzies, but I don't see that much to get excited about either way.
Perhaps not, but it does cast doubt onto Yahoo's claims.

--
Don't let THEM immanentize the Eschaton!
Re:Yahoo pants down, egg on face, no WMD either. by -brazil- · 2005-08-15 10:18 · Score: 1

Actually, their decision to throw out any queries resulting in more than 1000 pages returned seems very logical to me. How many times have you typed in a search, perused through the 1000 pages and felt like you just needed more options?

You don't seem to have understood what it really implies. It's more like "How many times have you typed in a search, seen that there are more than 1000 results, and then decided not to look at any of the results and instead look for something different instead.".

Basically, they limited their survey on only pretty obscure search terms. That's a pretty significant limitation. It's perfectly possible that Yahoo uses a method of gathering data that somehow skews towards content containing popular search terms, thus returning more results for such search terms, and quite probably also more relevant results at the top.

True, I don't think this is very likely, but it's possible.

--
The illegal we do immediately. The unconstitutional takes a little longer.
--Henry Kissinger
Re:Yahoo pants down, egg on face, no WMD either. by icemann476 · 2005-08-15 10:31 · Score: 1

I don't understand your logic, please elaborate and forgive my ignorance. I have never typed in a search (Google or Yahoo) and decided that because it returned more than 1000 pages, I won't bother looking at any of them. Google "Slashdot" for instance, Results are about 22,400,000 for Slashdot. The important part is that the first page displays the results I want. It becomes redundant to me if Yahoo returns "more" options.
Re:Yahoo pants down, egg on face, no WMD either. by -brazil- · 2005-08-15 10:55 · Score: 1

I have never typed in a search (Google or Yahoo) and decided that because it returned more than 1000 pages, I won't bother looking at any of them.

That's my point - the survey discussed here did exactly that, it completely disregarded any search terms with more than 1000 results. Thus it is rather dubious whether the survey's results say anything at all about the relative usefulness of the two engines in general (since most searches use terms that produce more than 1000 results) or their index sizes.

--
The illegal we do immediately. The unconstitutional takes a little longer.
--Henry Kissinger
Re:Yahoo pants down, egg on face, no WMD either. by 51mon · 2005-08-15 13:20 · Score: 1

But it is a statistical necessity if all you know accurately is there were more than a thousand results doing anything else is pretty pointless.

The only other approach that springs to mind for completeness testing is to take a selection of web pages, and see if any of them are missing from each engine. But I suspect finding a good selection of missing pages is a lot harder than tossing out popular results.

Done over a weekend - sure.
Inaccurate and imprecise - sure.
Puts the onus on Yahoo! to put up or shut up - sure.

This study isn't answering the question, it is asking Yahoo! to justify it's claim.
Re:Yahoo pants down, egg on face, no WMD either. by Anonymous Coward · 2005-08-15 13:24 · Score: 0

Don't worry. It'll be duped soon enough.
Re:Yahoo pants down, egg on face, no WMD either. by omar+alfaidi · 2005-08-15 14:13 · Score: 1

i was wondering if for example google returnted more than a 1000 words, and yahoo returned less than a 1000, will the keyword change the results? I guess yes
Re:Yahoo pants down, egg on face, no WMD either. by RedWizzard · 2005-08-15 14:27 · Score: 1

The study essentially is showing that there are more dictionary lists in Google than there are in Yahoo.
But Yahoo claim to have indexed more than twice as many pages as Google so how can this be? Do you think Google is focusing on indexing dictionary lists? Or that Yahoo are ignoring them in particular? If Yahoo's index is more comprehensive then they should have more sites indexed whatever the type of site in question. Maybe it's not valid to generalize from dictionary list sites to all sites, but I don't think it's clearly invalid either. Rather I think the big problem with this "study" is that they do not consider the quality of the results in terms of relevance to query. Google may be returning more results because a lot of those results are spurious.
Re:Yahoo pants down, egg on face, no WMD either. by barawn · 2005-08-15 15:58 · Score: 1

Do you think Google is focusing on indexing dictionary lists? Or that Yahoo are ignoring them in particular?

Well, the latter would be an intelligent thing for a search engine to do in returning results, but Yahoo could also feasibly have a crawler which doesn't bother indexing dictionary lists - i.e. once you get to some ridiculous number of independent words in the page, you toss it. It'd be a nice antispam tool, actually.

Or it could focus on HTML pages rather than text files, as most of the dictionary lists seem to be text files.

Maybe it's not valid to generalize from dictionary list sites to all sites, but I don't think it's clearly invalid either.

It isn't clearly invalid. If you remove the dictionary-list results, you still get the same answer. That doesn't mean the original conclusion wasn't wrong, though - just that by luck, they heavily oversampled a subset that was a representative sample of the whole.

Interestingly, when Yahoo returns the estimated number of searches (on non-dictionary-list searches) it returns about 2X as many results as Google. But when you go through them, it's only about a third. So not only does Yahoo actually have about half the indexed pages of Google, but they lie to you and say they've got about twice - on each result.
Re:Yahoo pants down, egg on face, no WMD either. by foobarb · 2005-08-15 16:20 · Score: 1

Okay, but

a) If the web is really much bigger than the indexed web, then might not these two indices overlap rather than one being a subset of the other?

b) Couldn't someone just measure the size of each index and know?

c) Shouldn't we care more about inflation claims on results returned? Since when are these called "estimates"?

Accurate results? by bigwavejas · 2005-08-15 06:12 · Score: 5, Interesting

Google sometimes returns some pretty interesting/ entertaining results.

Try searching for the word, "failure" in Google and check the results.

This brings into question *accurate* results. In this case it appears that's left to interpretation.

--
"Simplify, simplify, simplify!" Thoreau

Re:Accurate results? by Anonymous Coward · 2005-08-15 06:18 · Score: 1, Insightful

Actually, this was the result of a bloggers' linking campaign to do just that.

In response, you can see Michael Moore in the #2 position.
Re:Accurate results? by DroopyStonx · 2005-08-15 06:24 · Score: 2, Funny

Um, GW Bush is the first result.

Seems fairly accurate to me...

--
We have secretly replaced these Slashdot mods' sense of humor with a rusty nail. Let's see if they notice!!
Re:Accurate results? by Monty845 · 2005-08-15 06:25 · Score: 1

The next logical step would be to take a list of the results that a query generates and examine the ones unique to each search engine. The difficult part would be creating a methology for determining relevency that wasn't subjective to the opinions of the analyst.

The other possibility is that yahoo's indexing system preferentially indexes popular pages.. not sure if that is a reasonable possibility.
Re:Accurate results? by Anonymous Coward · 2005-08-15 06:27 · Score: 0

The interesting thing though is that the word "failure" doesn't appear anywhere on the page.

Though I am not disagreeing with you...
Re:Accurate results? by jrallison · 2005-08-15 06:28 · Score: 5, Insightful

It is odd however the #1 result for failure is a webpage without the word "failure" in it.
Re:Accurate results? by ArsonSmith · 2005-08-15 06:29 · Score: 1

That's funny I get POOP, DICK, VIGINAS, POOP, DICK, VAGINAS. Wonder why that would be a failure? You're right though it is kinda funny. Say that like 5 times out load at work.

--
Paying taxes to buy civilization is like paying a hooker to buy love.
Re:Accurate results? by Anonymous Coward · 2005-08-15 06:30 · Score: 1, Informative

It is called a "Google Bomb"

http://en.wikipedia.org/wiki/Google_bomb
Re:Accurate results? by AnUnnamedSource · 2005-08-15 06:30 · Score: 0, Troll

Looks pretty accurate to me. Type in "failure"--get a picture of George W. Bush. How much more accurate do you want?

--
-- "On second thought, let's not go there. Camelot is a silly place."
Re:Accurate results? by MindStalker · 2005-08-15 06:30 · Score: 4, Insightful

Well google also indexes based upon refering links and not just the context in the page itself. So if many websites refer to GW as a failure, GWs page itself will turn up as a high hit. Yahoo does this as well, but doesn't not nessesarly give it the same weight. This could highly affect amounts of returns. Because if we say that google returned X pages for a search on term "y" many of these pages may not actually mention "y" thus giving a larger page count for "y". While with yahoos method, it will mainly return pages that mention "y" themself. And possibly add some pages that are mentioned to include "y" by links. This can vastly alter the count.
Re:Accurate results? by ArsonSmith · 2005-08-15 06:33 · Score: 1, Interesting

Hmm, bumbling idiot possibly, but sense when has becoming the President of the US, then being elected again been the mark of a failure????

--
Paying taxes to buy civilization is like paying a hooker to buy love.
Re:Accurate results? by l3v1 · 2005-08-15 06:42 · Score: 1

Try searching for the word, "failure" in Google and check the results.

You can't honestly think that someone sane enough would use any kind of text-indexing database search engine for making a query like "query". That would render the whole concept of rdbms and some dozen years of cbir research instantly useless, since you would need to filter out all the relevant [relevant for you, that is] information all by yourself from the vast amounts of useless crap that a response for a query like "failure" would give. So, what was you point again ?

--
I am putting myself to the fullest possible use, which is all I can think that any conscious entity can ever hope to do.
Re:Accurate results? by tarquin_fim_bim · 2005-08-15 06:49 · Score: 0

Not on the part of the individual in question perhaps, more the failure of the educated portion of the US electorate to stop a war mongering facist from being re-elected.
Re:Accurate results? by Anonymous Coward · 2005-08-15 06:57 · Score: 0

If you turn off SafeSearch, you get Michael Moore in the #1 spot.
Re:Accurate results? by Krach42 · 2005-08-15 07:02 · Score: 1

Well, Michael Moore is second place.

So, I'd say that it's at least fair and non-partisan.

--

I am unamerican, and proud of it!
Re:Accurate results? by Anonymous Coward · 2005-08-15 07:04 · Score: 0

[...] to stop a war mongering facist from being re-elected.
Totalitarian, not fascist; totalitarianism smacks of Bolshevik micromanagement, and fascism: the charisma of Hitler and Mussolini.
Re:Accurate results? by G27+Radio · 2005-08-15 07:05 · Score: 1

I believe this came about from Michael Moore (and others) creating links that say miserable failure and link to GWB's biography.

If you'll notice, Michael Moore's site shows up next on this list despite the fact that it no longer contains the "miserable failure" link.

I short, if you create a link to a site, the words in your link will also be associated with that site.
Re:Accurate results? by gstoddart · 2005-08-15 07:06 · Score: 1

It is odd however the #1 result for failure is a webpage without the word "failure" in it.

Since Google is bringing up pages with an eye to how many other people link to you, clearly a lot of people have used 'failure' in reference to GW Bush.

--
Lost at C:>. Found at C.
Re:Accurate results? by Krach42 · 2005-08-15 07:15 · Score: 1

Not on the part of the individual in question perhaps, more the failure of the educated portion of the US electorate to stop a war mongering facist from being re-elected.

You're making the assumption that the educated portion of the US did not want him to be re-elected. Remember, people with an education aren't all the same liberals that wander around your campus.

The general view of voting educated people is that they are conservative.

I'd rather say, it's the success of a bunch of educated war-mongering jump-on-the-bandwagon people to keep him in office.

You can look at is as a failure of the minority opinion of the US to keep him from getting re-elected, but at that point, you have a smaller failure against a bigger success, and we can move on to arguing POV.

--

I am unamerican, and proud of it!
Re:Accurate results? by DroopyStonx · 2005-08-15 07:16 · Score: 0, Troll

But maybe that's the thing... Google is intelligent.

It just KNOWS.

I mean, we know by looking at a picture that a car is a certain color without the text "blue truck" being on a page - why couldn't google do the same?

Maybe google is god.

--
We have secretly replaced these Slashdot mods' sense of humor with a rusty nail. Let's see if they notice!!
Re:Accurate results? by ImaLamer · 2005-08-15 07:21 · Score: 1

Funny because "bumbling idiot" gives me this result first:

http://www.hypocrites.com/article9025.html

--
Get your Unix fortune now!
Re:Accurate results? by the_helper_monkey · 2005-08-15 07:22 · Score: 1

Well, if you do the same search on Yahoo! G-Dub shows up in the fourth spot. So if we're questioning accuracy, it has to be done for both engines.
Re:Accurate results? by Anonymous Coward · 2005-08-15 07:26 · Score: 0

So, you're saying that Yahoo's excuse is that Yahoo is using the old way of ranking pages that was vastly discredited when Google came out years ago. That's not a very good excuse. If I were a Yahoo shareholder, I'd be thinking about firing the board right about now.
Re:Accurate results? by thadog · 2005-08-15 07:34 · Score: 1

Google doesn't lie, so don't question the results!
Re:Accurate results? by 1u3hr · 2005-08-15 07:38 · Score: 1

when has becoming the President of the US, then being elected again been the mark of a failure????
Getting a job and one's performance in it are two different things. The Peter Principle applies here as well.
Re:Accurate results? by Anonymous Coward · 2005-08-15 07:40 · Score: 0

When I typed in 'failure' I got Bush, Clinton,
Democrat, and Republican. Pretty much all one
and the same.
Re:Accurate results? by fermion · 2005-08-15 07:44 · Score: 1

The problem with pre-google search engines was that they assumed that the words on a webpage were inherently linked tot he content of the web page. Therefore, if a web page had many occurances of "simpson" in the web page and keywords, then it was likely that such a page might have a relation to the actress of animated show.
Google fixed this problem by including the links to the pages in the keyword search, which prevented web sites from simply putting unrelated words in the context to generate hits. However, it did nothing to keep owners from setting up web farms or others from attacking the site by linking with unrelated phrases.
What is interesting is that Google results are getting less reliable. Often a few of the first several result are link farms or other content neutral ad sites. We should have new technology to fix the problem, in the same way that Google fixed the problems with Alta Vista, but we don't.

--
"She's a scientist and a lesbian. She's not going to let it slide." Orphan Black
Re:Accurate results? by scovetta · 2005-08-15 07:45 · Score: 1

I believe the Waffle Campaign was based on this idea.

--
Wer mit Ungeheuern kämpft, mag zusehn, dass er nicht dabei zum Ungeheuer wird. --Nietzsche
Re:Accurate results? by ArsonSmith · 2005-08-15 07:48 · Score: 1

Following the Peter Principle then, He is almost good enough to be President of the US. Again I wouldn't really call that failure. Moron, bad speaker, coorprate agenda driven person, there are several ways to define GWB but failure is absolutly not one of them. If you're playing a game and the score is 70 - 40 and you're on the winning team. If you miss that last shot does that make you a failure?

--
Paying taxes to buy civilization is like paying a hooker to buy love.
Re:Accurate results? by MindStalker · 2005-08-15 08:03 · Score: 1

Not nessesarly, simply they use a different mix than google. Thats why failure turns up GW in slot 1 for google, and slot 4 yahoo. It should not turn up at all, if simply indexing by page content soley.
Re:Accurate results? by Anonymous Coward · 2005-08-15 08:12 · Score: 0

When you grow up around power and money, and you are helped into every job you have held, and you manage to perform poorly in every one of those jobs, usually requiring outside help to bail you out, I don't think the Peter Principle even begins to capture the raw failure that has been W's life.
Re:Accurate results? by cheaphomemadeacid · 2005-08-15 08:26 · Score: 0

Yahoo is DEFINITIVELY better, just look at the scientific evidence! if we do a search on google, only searching for site:yahoo.com it only finds 48.1
million results. Doing the same on yahoo.com yields 425 million results! so yahoo is ALMOST 10 times better! Vote for yahoo now!
Re:Accurate results? by DroopyStonx · 2005-08-15 08:35 · Score: 1

Ooop, looks like a Bush-supporter found my posts!!

--
We have secretly replaced these Slashdot mods' sense of humor with a rusty nail. Let's see if they notice!!
Re:Accurate results? by The+Angry+Mick · 2005-08-15 08:36 · Score: 1

Often a few of the first several result are link farms or other content neutral ad sites. We should have new technology to fix the problem

I think the problem here is not so much a need for new technology to filter out the link farmers, but rather a set of domain registrars that actually validate the ownership of the domains being registered.
Here at my office, we're currently running the rounds with a company that registered a ".com" version of our non-profit domain name, and is trying to sell services that he's literally cut-n-pasted from our site. According to the registrar's data, the company that owns the similar domain is based in California using two street addresses; one of which is in a Mailboxes, Etc. store and the other a UPS store.
If ICANN would threaten to pull the accreditation of any registrar that refuses to validate that a domain name's owner is an actual business, instead of a blatant front, I imagine lots of the link farms would go away.
Of course, that may just be wishful thinking on my part.

--
I'm not tense. I'm just terribly, terribly, alert.
Re:Accurate results? by Izmet+Fekali · 2005-08-15 08:39 · Score: 1

This is a result of the "Miserable Failure" Googlebomb and was covered all over the web at the time, for example here.

--
-- Izmet Fekali Burek Experts Ltd.
Re:Accurate results? by theapodan · 2005-08-15 08:49 · Score: 0, Flamebait

A Bush supporter?

More like THE Bush supporter, as bad as polls and general consensus are against him.

Mod me flamebait if you must, but just read the paper to find that I'm becoming more and more truthful.
Re:Accurate results? by Anonymous Coward · 2005-08-15 12:31 · Score: 0

> It is odd however the #1 result for failure is a webpage without the word "failure" in it.

Not really. Google is just listing the #1 page that should contain the word "failure."

Remember, Google indexes by relevance, not keywords. The #1 result is absolutely relevant to the search term, and the fact that the search term isn't listed doesn't change that.

(Unless you argue that not admitting failure increases the chances of repeating it - in which case not mentioning it actually increases its relevance.)
Re:Accurate results? by 1u3hr · 2005-08-15 14:14 · Score: 1

but failure is absolutly not one of them. If you're playing a game and the score is 70 - 40 and you're on the winning team.
To me it looks lke 80-20, on the losing team.
In any case, posterity will decide.
Re:Accurate results? by Deodat · 2005-08-15 17:33 · Score: 0

Haven't you heard of Google bombing?

NCSA is comparing the archives... by Anonymous Coward · 2005-08-15 06:13 · Score: 1, Funny

by surfing to each page in each archive with the most recent NCSA Mosaic.

It will take a while.

Re:NCSA is comparing the archives... by Winckle · 2005-08-15 06:34 · Score: 1

Yeah, especially when they reach http://www.dukenukemforever.com/

Conclusion by mboverload · 2005-08-15 06:13 · Score: 3, Informative

"Based on the data created from our sample searches, this study concludes that a user can expect, on average, to receive 166.9% more results using the Google search engine than the Yahoo! search engine. In fact, in the 10,012 test cases we ran, only in 3% of the cases (307) did Yahoo! return more results. In 96.6% of the cases (9,676) Google returned more results. In less than 1% of the cases (29) both search engines returned the same number of results. It is the opinion of this study that Yahoo!'s claim to have a web index of over twice as many documents as Googles index is suspicious. Unless a large number of the documents Yahoo! has indexed are not yet available to its search engine, we find it puzzling that Yahoo!'s search engine consistently returned less results than Google. "

Re:Conclusion by nutshell42 · 2005-08-15 06:26 · Score: 4, Insightful

And Nutshell42's New Amazing Search Engine gives you even more results. Even though my index size is only 1.something million. I simply return every single wikipedia article in every language as result no matter what you search.
Concluding that Yahoo's index has to be smaller because they return fewer results seems a bit overzealous. Only a thorough study comparing results and how useful they were (which is hard to do, expensive and time consuming) has any meaning that goes beyond producing lots of funny numbers and percentages.
96.34% of all percentages are completely useless.
btw. I use google, not yahoo

--
Don't think of it as a flame---it's more like an argument that does 3d6 fire damage
Re:Conclusion by rossifer · 2005-08-15 06:30 · Score: 3, Insightful

Concluding that Yahoo's index has to be smaller because they return fewer results seems a bit overzealous.

No, it's accurate. They're testing Yahoo's claim of how many pages they've indexed, which just means that all indexed pages that contain the requested words should be returned from the search request. If yahoo returns fewer unique pages, yahoo has indexed fewer pages.

What you're talking about is measuring the effectiveness of page ranking, which is a completely different measure of how good a search engine is. Note: Google wins on that measure too.

Regards,
Ross
Re:Conclusion by Anonymous Coward · 2005-08-15 06:52 · Score: 1, Interesting

Hmmm, from my experience google sometimes returns results that don't have the search terms in the page... but the result is a page that has that search term is linked to that result... i think that makes sense... but then again i might just be on crack
Re:Conclusion by Retric · 2005-08-15 06:54 · Score: 1

No, google will add pages that don't include the word you searched for. Thus you can't assume that the page is not in yahoo's index because they did not return it.

EX: Search for "Failure" on google and you get linked to a page that never uses that word. Granted a Biography of President George W. Bush might fit the search criteria but it might not be returned by all search engines even if it was in their database.
Re:Conclusion by barawn · 2005-08-15 06:55 · Score: 5, Insightful

No, it's accurate. They're testing Yahoo's claim of how many pages they've indexed, which just means that all indexed pages that contain the requested words should be returned from the search request. If yahoo returns fewer unique pages, yahoo has indexed fewer pages.

Actually, it might not be, thanks to their methodology.

They only used searches with less than 1000 results. They therefore got a lot of searches with small results numbers (because they were searching for bizarre word combinations, like "promotion bedabble"). The total number of results was something like 500,000 or so (order of magnitude) for 10,000 searches. That's an average of 50 results/search, and I'd bet there's a large, large tail, so the most common search is probably something like 10 results.

The problem with this is that in their word list, the same sites are being returned over and over!. For instance, sites containing dictionary lists appear in both "promotion bedabble" and "foliolate defecations" because, duh, that's the only place they'll appear. Since they're just searching the same type of site over and over, they get the same result magnified a lot: Google has more "dictionary lists" in its index than Yahoo. Most of the "dictionary list" word searches returned about 10-20 for Google, and few, if any, for Yahoo.

It's a pretty serious flaw in the methodology, as far as I can tell - they're double counting huge numbers of results, and so they're not really getting a good statistical sample of the index.
Re:Conclusion by sysadmn · 2005-08-15 07:00 · Score: 1

To be pedantic, and I am, shouldn't it say that "Yahoo!'s search engine consistently returned fewer results than Google"?

--
Envy my 5 digit Slashdot User ID!
Re:Conclusion by Anonymous Coward · 2005-08-15 07:11 · Score: 0

The fact that a lof of pages returned by Google that are not returned by Yahoo are lists of words gives a hint of another problem with the methodology: Google probably indexes more of each page (let's say the first 100K) than Yahoo.
Let's do an experiment:
Searching for "nickeled craniology" returns 18 pages on google, none on Yahoo. Most of those pages seem to be lists of words, which start by something like "aardwolf abaca abaci abacuses".
Now if you search for THAT string ("aardwolf abaca abaci abacuses" with the quotes) in both search engines, the results are very different: Yahoo finds 56 results, Google only 14.
So maybe Google indexes more bytes of data, while Yahoo indexes more pages.
Re:Conclusion by christor · 2005-08-15 07:19 · Score: 2, Interesting

Instructions to build search engine with "largest number of indexed pages":

1. Make a list of 999 sites.
2. Set up website with a query input form.
3. Upon query, return the entire list.

A major problem with this study is that the number of results returned depends on two variables: (a) the number of sites in the index (so far so good) and (b) the accuracy and sensitivity of the search algorithm. The latter is the very point of a search engine. Yahoo may, who knows, be more selective in returning results.

I'm a google fan, but these results prove nothing.
Re:Conclusion by nutshell42 · 2005-08-15 07:26 · Score: 1

No, it's accurate. They're testing Yahoo's claim of how many pages they've indexed, which just means that all indexed pages that contain the requested words should be returned from the search request. If yahoo returns fewer unique pages, yahoo has indexed fewer pages.
No it's not. For some more unusual terms over half the results google offers are sites that don't have the word anywhere it's just on pages linking to the site. Most of the time that's not what you want and I've never had a useful result that didn't contain all search terms.
In addition, perhaps yahoo is just better at filtering out meta-tag abuse. There are topics were google only returns dialers, web farms and similar stuff for the first three pages. Incidentially that's also often the case if you're searching for uncommon combinations that don't tend to appear naturally on sites not related to the (also uncommon) topic you're interested in.
What you're talking about is measuring the effectiveness of page ranking, which is a completely different measure of how good a search engine is.
No, I'm talking about the quality of the results. This does mean page ranking for search terms with 1 Mio+ results but it also means indexing especially if there are only 5-10 useful sites on the web
Note: Google wins on that measure too.
Now that settles it. Very in depth. Note: I never doubted that. As I said I use google, but this study sounds like a kneejerk reaction to a perceived attack on Google's supremacy. Action stations! Action stations! Yahoo at the gates! All geeks to the rescue!

--
Don't think of it as a flame---it's more like an argument that does 3d6 fire damage
Re:Conclusion by Krach42 · 2005-08-15 07:29 · Score: 1

I'd have to agree with you. There's no contingency taken that the results must be disjoint.

So, we'd need a test for search results of the same sort, only track unique pages. This would really only give an average list, but we're trying to find the number of underlying indices, not the number of search results for foo and bar independently.

--

I am unamerican, and proud of it!
Re:Conclusion by betsywetsy · 2005-08-15 07:30 · Score: 1

The fact that a lof of pages returned by Google that are not returned by Yahoo are lists of words gives a hint of another problem with the methodology: Google probably indexes more of each page (let's say the first 100K) than Yahoo. Let's do an experiment...

Interesting idea, and a good test of it!
Re:Conclusion by Anonymous Coward · 2005-08-15 07:32 · Score: 0

The study only looked at English queries.

In order to create a large number of queries that returned less than 1,000 results, we took the commonly available English Ispell Wordlist (a total of 135,069 words) [3] and wrote a PERL script to randomly select two words at a time from that list.

Perhaps Yahoo! indexes more pages in other languages than the competition. This methodology wouldn't confirm that.
Re:Conclusion by Anonymous Coward · 2005-08-15 07:33 · Score: 0
The method used to estimate index sizes is one used by search engine companies and seems to produce valid estimates and though there is something missing from the analysis (they didn't seem to investigate the documents found by one but not the other to investigate if these were indeed relevant from an IR perspective), the numeric results are usually indicative of index size.
But don't take my word for it. Perform your own experiments. There are several interesting things you could do to gain an understanding of methodologies:
- Index same corpus with different tools (like htdig versus Nutch)
- Index different corpi with same tool
- Index different corpi with different tools
- Make use of different linguistic tools (ie. lemmatization, synonym injection etc)
If I were to criticize anything here it would be that too little time was spent on trying to gauge contributing factors (like linguistics tools etc) and there are some assumptions that aren't necessarily true. These should be listed up front and one should attempt to look at their contributions, but this is way beyond what most posters seem capable of providing valid criticism for.
Re:Conclusion by barawn · 2005-08-15 07:34 · Score: 2, Insightful

Correction to myself: the total responses to their list was ~150,000 to ~10,000 searches for Yahoo, and ~400,000 for Google. So the average is 15 results for Yahoo and 40 for Google. Given that most "dictionary list" results were between 10 and 40, that should pretty much tell you that their entire result is just a massively multiplied reflection of those searches.

As an interesting aside, though: if you dig through their log, you can see several interesting things. If you look at only results which return between 100 and 1000 results, you get things like "battening liberate", which returned 186 for Google, and 97 for Yahoo. Those aren't dictionary list results - the interesting thing is that in almost all of those results, you see an extremely similar pattern.

"battening liberate":
Ratio of Google/Yahoo for this query:
Duplicates Omitted Estimate: 0.522305
Duplicates Omitted Total: 1.917526
Duplicates Included Estimate: 0.533962
Duplicates Included Total: 2.350427

"convexity hac"
Ratio of Google/Yahoo for this query:
Duplicates Omitted Estimate: 0.573593
Duplicates Omitted Total: 3.340000
Duplicates Included Estimate: 0.583700
Duplicates Included Total: 2.490566

"meekness goatee"
Ratio of Google/Yahoo for this query:
Duplicates Omitted Estimate: 0.607053
Duplicates Omitted Total: 2.207692
Duplicates Included Estimate: 0.604010
Duplicates Included Total: 2.745562

So Yahoo claims it has 2X as much as Google, but actually only returns about 30-50%.

Interestingly, these mimic the "dictionary list" results, which is curious. So their conclusions seem right, but their methodology seems very wrong.
Re:Conclusion by Anonymous Coward · 2005-08-15 07:37 · Score: 0

Go ahead and be pedantic. Most here won't understand you anyway. Apparently grammar isn't taught in schools nowadays.
Re:Conclusion by Anonymous Coward · 2005-08-15 07:45 · Score: 0

This is why I use Google reluctantly. The majority of results I get, especially near the top, for most real world searches are just garbage pages with word lists. This problem is an order of magnitude less severe on Yahoo. I can usually count on more results from Google, but more results that matter and about ten times fewer word list+banner ad pages on Yahoo.
Re:Conclusion by MemeRot · 2005-08-15 07:55 · Score: 1

That would be a very good point.

Except that the GW Bush biography shows up in the top ten of Yahoo's search results as well.
Re:Conclusion by Retric · 2005-08-15 08:11 · Score: 1

The point is not that this page shows up but rather that pages will show up that don't have the given search words in them. Google might have simply set a low threashold before they show you pages. Or as seen by the biography page they could give more rank based on how many pages have that work and link to that page.

Anyway, it's easy to test:

Google: Goat Spork tactile shows Goat Spork tactile as it's 4th hit.
Yahoo: Goat Spork tactile does not return that page.

But Yahoo:"tracks 40 to 50 fitness indicators" Forgotten Atrocity gives the page SO:

Their methiod is flawed.
Re:Conclusion by Council · 2005-08-15 08:14 · Score: 1

I've noticed lately that adding extra terms to Google searches can generate MORE results, so obviously they're not returning every page containing those terms. It's good to know they've got a more complex search algorithm than simple regexp, but at the same time it means that you can't index database size by such simple assumptions about how they're searching.

Comments?

--
xkcd.com - a webcomic of mathematics, love, and language.
Re:Conclusion by Anonymous Coward · 2005-08-15 08:29 · Score: 0

> It's a pretty serious flaw in the methodology, as far as I can tell -
> they're double counting huge numbers of results, and so they're
> not really getting a good statistical sample of the index

On top of that, they are ignoring the possibility that for every single page for which Google returns over a thousands hits, Yahoo might return twice as many as Google.

It is very easy to write a search engine that always returns hundreds of links for every search (just make a few hundred page 'somebody searched for 'xxx' on ...', and return a link to it. throw away pages made for old searches when running out of room on your server). This paper would score such a search engine higher than Google.
Re:Conclusion by Tired+and+Emotional · 2005-08-15 09:13 · Score: 1

I think it could be worse than that.

What if the searches that showed Yahoo to search more pages (assuming for the moment that this is in fact the case) were those that blew the 1000 hit limit?

In other words there are a bunch of pages that are unlikely to ever appear in one of their accepted searches and I don't think it is a reasonable assumption that the sizes of these sets is comparable for each engine.

(Which does not mean I believe their result is wrong, just that I think they have failed to prove it is correct)

--
Squirrel!
Re:Conclusion by SnprBoB86 · 2005-08-15 09:18 · Score: 1

Now, I haven't used Yahoo in years... but let me just say:

I much rather a small set of good results than a large set of bad ones. As a matter of fact, I much rather a small set of bad results than a large set of bad results. I use search engines to answer questions and provide information, not to wade through huge piles of noise. The number of results returned is a terrible measure of search engine quality.

I would like to see a signal to noise ratio.

--
http://brandonbloom.name
Re:Conclusion by barawn · 2005-08-15 10:02 · Score: 1

What if the searches that showed Yahoo to search more pages (assuming for the moment that this is in fact the case) were those that blew the 1000 hit limit?

Unless the extra pages that Yahoo has all contain similar text, this is unlikely. Otherwise, you can always find word combinations which will be below the 1000 hit limit which will return pages in Yahoo's extra corpus.

Randomly going through words is a good way to do this, except for the bias produced by word list pages. There are a few ways to fix this bias - only use results which return more than 100 hits, which is probably safe, but kills your statistics (10,000 searches down to 300) - or rerun, taking the results which gave you more than 1000 results and add a third word from the results which gave you more than 1000 results, and you'll probably get a high-statistics set of data with less than 1000 results that doesn't include many dictionary lists.
Re:Conclusion by X · 2005-08-15 12:49 · Score: 1

Concluding that Yahoo's index has to be smaller because they return fewer results seems a bit overzealous.

No, it's accurate. They're testing Yahoo's claim of how many pages they've indexed, which just means that all indexed pages that contain the requested words should be returned from the search request. If yahoo returns fewer unique pages, yahoo has indexed fewer pages.

No. If they were using the same algorithm to determine which pages to return, you'd be correct. They don't. It's entirely possible that Yahoo's algorithm is just being more discerning. This is particularly likely as in order to preserve relevance it is usually necessary to make your algorithm more discerning when you increase your index.

Given that the queries were made up of random pairings of words from Ispell which returned less than 1000 results (we're talking less than a millionth of either index), it's quite likely that most of the queries were nonsensical and even in the result sets that did come back, it'd be hard to argue any of the pages are relevant. A perfect search engine would probably return nothing.

--
sigs are a waste of space

Interesting. by Poromenos1 · 2005-08-15 06:13 · Score: 1

I was wondering how accurate were the results that the companies themselves reported. Or are they accurate, but they just spidered sites that don't matter to anyone?

--
Send email from the afterlife! Write your e-will at Dead Man's Switch.

They might have a larger index file by BlackCobra43 · 2005-08-15 06:14 · Score: 4, Insightful

but they can't sift through it nearly as well as Google, so what does it matter? Even if you have a bigger dictionnary, if you can't speak English at all it won't do you much good.

--
I never spellcheck and I freely admit it. Save your karma for more worthwhile "lol erorrs" replies

Re:They might have a larger index file by Anonymous Coward · 2005-08-15 06:21 · Score: 0

stfu n00b
Re:They might have a larger index file by babyrat · 2005-08-15 06:43 · Score: 1

Even if you have a bigger dictionnary, if you can't speak English at all it won't do you much good.

What if the dictionary is a German dictionary and you speak German?
Re:They might have a larger index file by BlackCobra43 · 2005-08-15 06:53 · Score: 1

Then the dictionary would indeed help you speak German better...? I'm sorry, was that a trick question? I'm not following you.

--
I never spellcheck and I freely admit it. Save your karma for more worthwhile "lol erorrs" replies
Re:They might have a larger index file by Krach42 · 2005-08-15 07:07 · Score: 1

I bought a German-English Dictionary, (I'm a native English speaker) and casually happened to mention to someone that it had words in it that I'd never seen before.

My friend replied, "Well, duh, it's not like you know German that well."

To which I pointed out to him, that I was refering to the ENGLISH side.

Any foreign language dictionary that can give me the obscure proper past tense of the nautical word "heave" is awesome in my book. (FYI: it's "heave, hove, hove")

--

I am unamerican, and proud of it!
Re:They might have a larger index file by richie2000 · 2005-08-15 07:32 · Score: 1

To which I pointed out to him, that I was refering to the ENGLISH side.
I really, really hope you did that by hitting him over the head with it. The English side, of course. Was it heavy?

--
Money for nothing, pix for free
Re:They might have a larger index file by Krach42 · 2005-08-15 07:52 · Score: 1

It's about as large as a normal dictionary. I won a scholarship in high school for being the student, who had learned German the best (meaning: at all) so they gave me $50. So me with my moral bearing went out and bought a $50 German-English dictionary.

I'm happy that I have purchased it though. The thing is great. Not as good as some online translators, but sometimes it's good to have the paper between your fingers, and just rifle through and look up randomness.

--

I am unamerican, and proud of it!
Re:They might have a larger index file by Anonymous Coward · 2005-08-15 09:55 · Score: 0

Have you actually tried Yahoo in the last year or so?

They've closed the gap quite a bit, to the extent that I switched to Yahoo when I stared using MyWeb and haven't noticed much different. YMMV, of course.
Re:They might have a larger index file by Anonymous Coward · 2005-08-15 10:34 · Score: 0

where are the offtopic mods?

sir, you really think your offtopic drivel is so important to use your +1 karma bonus as well?
Re:They might have a larger index file by Krach42 · 2005-08-16 07:06 · Score: 1

I always post with my karma bonus.

If you don't like it, then the next time you have mod points, mod me down, and get my karma down to the point that I don't have a karma bonus anymore.

Until then, shut up, I can use my karma bonus however I want. That's my right as someone who has the karma bonus.

--

I am unamerican, and proud of it!

Why not use both? by LuciferBlack · 2005-08-15 06:14 · Score: 1

Just use both...Then you'll be certain to have a nice unbiased search result. ;)

--
I'm working on a good joke about your mom being /.'d, but it's not finished yet.

What would you want them to return? by brunes69 · 2005-08-15 06:16 · Score: 1

All Google does is index the web. In this case, it seems like there are more web pages/more highly linked pages about GW being a failure than anyone else.

Is this that hard to beleive? What would you rather it return for such a query? A dictionary definition? If you want a dictionary definition, use the define: oerator.

Trust me - GW will not be on the top of the failure list forever. In another few years we will have a new most-hated person. This is the nature of a real web index, because it is the nature of the web, and of society itself - it is fickle.

Re:What would you want them to return? by bigwavejas · 2005-08-15 06:20 · Score: 1

I have no opinion on it actually. I just found it interesting Google displayed GW as result 1 and Yahoo! as result 4. Obviously there's two different search methodologies used here.

--
"Simplify, simplify, simplify!" Thoreau
Re:What would you want them to return? by Dan+Up+Baby · 2005-08-15 06:21 · Score: 1

If this was a natural result I would agree with you, but like "French Military Victories" it was orchestrated; it's not the real web, and it doesn't illustrate how the web actually uses the word.
Re:What would you want them to return? by Intron · 2005-08-15 06:29 · Score: 4, Insightful

The top of the page return for Yahoo is

"Failure on eBay Find failure items at low prices. "

which illustrates the most important difference between Yahoo and Google.

--
Intron: the portion of DNA which expresses nothing useful.
Re:What would you want them to return? by robertjw · 2005-08-15 07:01 · Score: 1

What would you rather it return for such a query?

Something related to the word "failure". Did a search of the pages for GW, Jimmy Carter and Michael Moore that came up for a search word "failure". That word isn't even in any of those three pages. Seems to me there is something wrong when a search term lists pages that don't even have the actual word in it.

--
Find coupons in Greeley
Re:What would you want them to return? by mmkkbb · 2005-08-15 07:22 · Score: 2, Funny

Seems to me there is something wrong when a search term lists pages that don't even have the actual word in it.

donkey rhubarb

Once this comment is spidered, it will work towards PETA coming up when people search for "Donkey" and "rhubarb". If you check the cached version of the GW biography, it will say this at the top.

--
-mkb
Re:What would you want them to return? by Anonymous Coward · 2005-08-15 08:56 · Score: 0

... and the first Sponsored Link on Google is:
Success What Stopping You Life Success Has A Formula! Come and Get It Here. TheLifeSuccessFormula

your point being?
Re:What would you want them to return? by quanticle · 2005-08-15 09:18 · Score: 2, Informative

Actually Slashdot prevents robots from spidering its comment pages...

So your point is totally moot...

--
We all know what to do, but we don't know how to get re-elected once we have done it
Re:What would you want them to return? by Doc+Ruby · 2005-08-15 09:21 · Score: 1

Whether or not the high occurrence of links of "failure" to the Bush biography page was "orchestrated", it is the reality of state of the Web. At least as far as Google's formula ranks/weights pages/links. Unless you're saying that it was somehow orchestrated by Google, to make Bush look bad. You're probably just saying that the Web you'd prefer wouldn't be in that state, though the real one is.

--
--
make install -not war
Re:What would you want them to return? by -brazil- · 2005-08-15 10:15 · Score: 2, Insightful

it is the reality of state of the Web. At least as far as Google's formula ranks/weights pages/links.

That second part is the important one. If search results can be manipulated by relatively small groups of people, this can be abused, e.g. for search engine spamming, thereby limiting the usefulness of the search engine.

--
The illegal we do immediately. The unconstitutional takes a little longer.
--Henry Kissinger
Re:What would you want them to return? by Doc+Ruby · 2005-08-15 10:36 · Score: 1

So let's have some evidence that the Google ranking of Bush's page as "#1 for 'failure'" is not the actual state of the Web. Extraordinary claims require extraordinary evidence.

--
--
make install -not war
Re:What would you want them to return? by -brazil- · 2005-08-15 10:49 · Score: 1

Here ya go. Basically, it seems that Google puts an extraordinarily high weight on link texts used to link to a document, so that a few hundred people linking to the Bush bio with the words "miserable failure" cause Google to think it's a top search result for those words - completely independent of the actual content of the document itself. The small number of pages involved is IMO the main argument against the theory that the google result accurately represents the "actual state of the web".

--
The illegal we do immediately. The unconstitutional takes a little longer.
--Henry Kissinger
Re:What would you want them to return? by Doc+Ruby · 2005-08-15 11:35 · Score: 1

The actual state of the Web is that links are more important than flat content. It's not just a bunch of files, it's a self-documenting reference system. Of course you know that - everyone posting to Slashdot does. But you're discounting the importance of that, because it leads to a result you don't like. Yet most people favor Google because it presents "more relevant" results to the top of its list, using exactly these kinds of metrics. In fact, "the Web" is more than its hyperlinked content: it's the people who consume it. And Google's ranking model represents that Web more accurately to us, the essential part of it, than do its competitors.

So perhaps we just disagree on semantics. But I think that the definition of what Google is modeling is important. And I'd say that most people on the Web agree.

--
--
make install -not war
Re:What would you want them to return? by monkeydo · 2005-08-15 11:47 · Score: 1

"Extraordianary" evidence. Courtesy of Google.

--
Si vis pacem, para bellum
The only thing more annoying than a Libertarian is an (un|mis)informed Libertarian
Re:What would you want them to return? by Doc+Ruby · 2005-08-15 11:55 · Score: 1

That is evidence only that the popular spin on the Web is that Google's page ranking is distorted when it demonstrates something unpopular with the political/media pundits, but not when they use Google for it's "high relevancy" all day long. The Web's links are more meaningful than the equivalent unlinked text, and these rankings demonstrate that.

If you can prove to me that there's another "failure" better represented on the Web to its consumers than is Bush, then you've got a point. Otherwise, all you've got is a double standard - no matter how popular it might be among a vocal minority.

--
--
make install -not war
Re:What would you want them to return? by monkeydo · 2005-08-15 13:15 · Score: 1

The Web's links are more meaningful than the equivalent unlinked text, and these rankings demonstrate that.

It doesn't demonstrate that at all. You've concluded that, and since that's how Google ranks pages, you've further concluded that Google's rankings are correct. You've created a syllogism, so further proof would be useless.

--
Si vis pacem, para bellum
The only thing more annoying than a Libertarian is an (un|mis)informed Libertarian
Re:What would you want them to return? by cduffy · 2005-08-15 13:27 · Score: 1

Perhaps you didn't notice that the EBay "failure" link returned by Yahoo! isn't clearly marked as sponsored, whereas Google's comparable return is?
Re:What would you want them to return? by Doc+Ruby · 2005-08-15 13:33 · Score: 1

No, I've asserted that I generally agree with the ranking system produced by Google, reknowned to be the best model of the Web's relevance to search terms. As generally agreed by the market. All you've got is your own foregone conclusion, with nothing but "no it's not" to back it up. You're the one with the syllogism, which is why you've got no proof. Any actual proof, if you had any, would break my syllogism, if such were my logic. But it ain't, and you don't, and that's that. QED.

--
--
make install -not war
Re:What would you want them to return? by Anonymous Coward · 2005-08-15 16:43 · Score: 0

Perhaps you didn't notice that the EBay "failure" link returned by Y! is clearly marked as sponsored?

In fact, it is to the far right on the page in a thin column with the heading "SPONSOR RESULTS."
Re:What would you want them to return? by rk · 2005-08-15 18:48 · Score: 1

On the other hand, searching both for "dead puppies" yields a "dog training secrets" ad at the top of Yahoo's, but from Google (on the ad bar on the right side), I get an offer from ConsumerIncentivePrograms.com for $500 and dead puppies "enter your zip code and get yours".

I'm really not sure which is worse, but they're both pretty damn funny. Of course, I'm a sick bastard, so...
Re:What would you want them to return? by BillyBlaze · 2005-08-15 20:09 · Score: 1

The original phrase for the google bomb was "miserable failure," and a Yahoo search for that does indeed turn up the biography.
Re:What would you want them to return? by cduffy · 2005-08-15 21:15 · Score: 1

to the far right on the page in a thin column
What part of "clearly" escapes you? I think you make my point.
Re:What would you want them to return? by monkeydo · 2005-08-16 03:00 · Score: 1

You can't disprove a syllogism, because a syllogism is true by definition.

In your case "Google's model of the web is the best model" therefore "Google returns the best results." You've created a logical construct where no one can tamper with Google results, because simply the fact that the results come from Google means they are the correct results.

It doesn't matter to you that the page that is number one in the results for "miserable failure" doesn't actually have any content useful to someone searching for information on miserable failures, or that there is copius evidence that the reason for that ranking was a deliberate campaign by a relatively few number of sites. According to you, as long as Google returns those results, they must reflect the "current state of the web" because you've conviniently defined the "current state of the web" to be whatever Google says it is.

HTH
HAND

--
Si vis pacem, para bellum
The only thing more annoying than a Libertarian is an (un|mis)informed Libertarian
Re:What would you want them to return? by Doc+Ruby · 2005-08-16 05:35 · Score: 1

No, because I've calibrated "best" to be measured by Google's popularity. As I pointed out in my post, the key is that Google's model of the Web inherently includes its consumers. And therefore its popularity due to its modeling of page popularity is no syllogism, but rather self-reinforcing resonance. A posteriori, not a priori. The real thing.

--
--
make install -not war
Re:What would you want them to return? by Anonymous Coward · 2005-08-27 18:10 · Score: 0

Can't figure how it's flamebait though.
FWIW, I just meta-modded it as unfair.

Flawed conclusion? by Prong_Thunder · 2005-08-15 06:16 · Score: 5, Insightful

Sorry, but if Google consistently returns more results, it could just as easily mean that the filtering isn't as good.

I still prefer Google though.

Re:Flawed conclusion? by Ossifer · 2005-08-15 06:24 · Score: 5, Insightful

Exactly! I find the conclusions of the research to be quite specious. Yahoo may simply have tighter controls of what is considered a match, which, by the way, is no simple algorithm.

In any case, I am usually not so interested in the numbers of matches, but in the quality of the list returned--hopefully one website will have exactly what I need...
Re:Flawed conclusion? by Lewisham · 2005-08-15 06:28 · Score: 1, Insightful

Agreed, whoever conducted this "research" is pretty idiotic. The pages returned != pages available.

This isn't worthy of the NCSA, or indeed any university, to be shown in any public format with any conclusions at *all*. You'd be laughed out of the conference hall if you presented this.
Re:Flawed conclusion? by OpenYourEyes · 2005-08-15 06:40 · Score: 1

An interesting solution to this problem could be to extend the test. For those pages that turn up in one results and not the other, query the other one for that exact page to see if it has it.
Re:Flawed conclusion? by Monkeyman334 · 2005-08-15 06:42 · Score: 1

Based on the data created from our sample searches, this study concludes that a user can expect, on average, to receive 166.9% more results using the Google search engine than the Yahoo! search engine.

Read that and tell me where they conclude that Google returns better results. You people need to actually read the conclusion, kthnx.
Re:Flawed conclusion? by daclink · 2005-08-15 06:49 · Score: 1

It could equally mean that google has better query expansion mechanisms. If google manages to return more documents containing synonymous terms then its relevant set would be bigger.

DaClink
Re:Flawed conclusion? by Ossifer · 2005-08-15 06:51 · Score: 1

Yes, a better test would be to find 10,012 random web pages somehow (port sniffing?) and then try to query for those pages...
Re:Flawed conclusion? by jhoger · 2005-08-15 07:05 · Score: 1

I think they could fix this problem by discarding result URLs which do not actually have the searched for term.

They aren't trying to infer how high quality the set of results is, just the relative proportion of sites indexed by either engine, so I think this would be a good solution.

-- John.
Re:Flawed conclusion? by barawn · 2005-08-15 07:17 · Score: 2, Insightful

Or it could mean that Google has more Ispell lists in its index.

Which appears to be the case.

A search for "inabilities hydrocephalic" returns almost all dictionary lists in Google, except 2. There's only 2 results in Yahoo, one of which is a dictionary list (or equivalent).

But the official results for this? 16 for Google, 2 for Yahoo.

The reason this is a problem is because almost every search returns the same dictionary lists, so it amounts to double (or probably around 5000-fold) weighting of those sites in the results.

Without excluding results that are just dictionary lists (which is quite hard from a simple analysis like this) you heftily bias your results to mimic the "Number of Google dictionary list sites/Number of Yahoo dictionary list sites" ratio.

They probably should've only included sites that returned between 100 and 1000 results, but I'd bet that would take a ton more time, as it looks like almost all of the results they used were the "10-50" result range.
Re:Flawed conclusion? by Enigma_Man · 2005-08-15 07:38 · Score: 1

kthnx, re-read what the parent poster was talking about before you go trying to be all high-and-mighty. They were discussing that # of results returned was in NO WAY a good estimate for DB size, because fewer results can (and should) be a result of a better algorithm. The parent was NOT saying that they conclude that Google returns better results. Jeez, I wish it hurt to be stupid, or wish there was a way to easily hurt stupid people. What I need is to wield a cluestick.

-Jesse

--
Nothing says "unprofessional job" like wrinkles in your duct tape.
Re:Flawed conclusion? by Michael+Woodhams · 2005-08-15 08:42 · Score: 1

I agree.

I would have added an extra step:
(1) In one engine, find a search that generates fewer* than 1000 hits.
(2) Select one** of those pages at random.
(3) Search for that page in the other engine.
(4) Repeat, starting with the other engine.

Then at the end we have a count of how many pages each engine has which the other does not.

* not "less", as consistently used in the article
** ideally, for independence of sampling, we'd chose just one, but for effiency we might choose more.

--
Quattuor res in hoc mundo sanctae sunt: libri, liberi, libertas et liberalitas.
Re:Flawed conclusion? by Monkeyman334 · 2005-08-15 08:54 · Score: 1

Article: Based on the data created from our sample searches, this study concludes that a user can expect, on average, to receive 166.9% more results using the Google search engine than the Yahoo! search engine.

Parent: Yahoo may simply have tighter controls of what is considered a match, which, by the way, is no simple algorithm.

Read the article and tell me where they conclude anything other than the fact that google returns more results. You people need to actually read the conclusion, kthnx. Yes, they hint that Yahoo might be lying, but they never conclude it.
Re:Flawed conclusion? by Maniacal · 2005-08-15 10:18 · Score: 1

But I don't want my search results filtered. Unless your talking about filtering out link scams and others that are trying to achieve higher page rankings. I want all the results returned. Filtering would suggest that they somehow know what I'm searching for. They may think they know, that's what rankings are for.

So, if it is a matter of filtering then Yahoo looses there as well. Filtering - Bad, Ranking - Good

Mike

--
MG
Re:Flawed conclusion? by Prong_Thunder · 2005-08-15 10:40 · Score: 1

I think perhaps you misunderstand what I mean by filtering.
When you enter a search term, a (good) search engine will filter out anything that doesn't at least partially satisfy that term.

The very best search engine would find you a (the?) page with exactly what you wanted. This isn't really possible (for now) so, as you say, that's what rankings are for.
Re:Flawed conclusion? by Stalus · 2005-08-15 13:56 · Score: 1

It's hard to say what exactly you mean by filtering by that sentence.. but.. I would guess it's more a problem of expasion rather than contraction.

Google will often return pages that had the query word in or around a link on another page that points to it. The query may not resemble anything in the document in any way, shape, or form, but since someone else referred to the document that way, it's returned. There are lots of very good and valid reasons to do this. A simple way to correct for this behavior would be to check how many of the returned results actually have your query in them.... and maybe use a decent word stemmer to be more reasonable.

I was kind of surprised that this had UIUC's name attached to it until I looked at the PhD listed. "Professor of History and Sociology". "Burton's research and teaching interests include the American South...". Okay.. so maybe his statistics aren't as strong as they should be. Or maybe they are and he just doesn't know enough about the implementation issues to do a proper statistical analysis. Perhaps he should talk to some of the AI and Info Retrieval guys on campus before he pops something like this up on the web though.
Re:Flawed conclusion? by RedWizzard · 2005-08-15 14:16 · Score: 1

The pages returned != pages available.
How do you come to that conclusion? From a users point of view the set of pages returned must the set of pages that are available. If a page is not returned it is not available, obviously. If a page is returned it clearly is available.
Re:Flawed conclusion? by Lewisham · 2005-08-15 18:59 · Score: 1

Because the algorithm that filters results that are relevant? Perhaps Yahoo! only returns results with all search terms, and Google starts returning results with only some search terms.

Seriously, there are so many ways, I don't even know why you asked this.
Re:Flawed conclusion? by RedWizzard · 2005-08-16 08:40 · Score: 1

By "pages available" do you mean "page containing the terms on the net"? I assumed you meant "pages available from the search engine in question".

results? by dotpavan · 2005-08-15 06:16 · Score: 1

quoting:"Based on the data created from our sample searches, this study concludes that a user can expect, on average, to receive 166.9% more results using the Google search engine than the Yahoo! search engine. In fact, in the 10,012 test cases we ran, only in 3% of the cases (307) did Yahoo! return more results. In 96.6% of the cases (9,676) Google returned more results. In less than 1% of the cases (29) both search engines returned the same number of results. It is the opinion of this study that Yahoo!'s claim to have a web index of over twice as many documents as Googles index is suspicious. Unless a large number of the documents Yahoo! has indexed are not yet available to its search engine, we find it puzzling that Yahoo!'s search engine consistently returned less results than Google. "

in short: truth is that size does not matter.. the hype behind bigger the better is *false*, just like its for penises :)

Re:results? by DirtyHerring · 2005-08-15 06:38 · Score: 1

in short: truth is that size does not matter.. the hype behind bigger the better is *false*, just like its for penises :)

Yet another guy with a small penis.
Re:results? by Anonymous Coward · 2005-08-15 07:12 · Score: 0

Have fun with that needledick of yours. God knows nobody else will.

The results by Swamii · 2005-08-15 06:17 · Score: 4, Interesting

For those that don't want to read the flippin' article:

Based on this random sample, we found that on average Yahoo! only returns 37.4% of the results that Google does and, in many cases, returns significantly less.

In other words, they believe Google indexes more items based on their own tests of searching.

--
Tech, life, family, faith: Give me a visit

Re:The results by mi · 2005-08-15 06:26 · Score: 2, Insightful

Based on this random sample, we found that on average Yahoo! only returns 37.4% of the results that Google does and, in many cases, returns significantly less.
Informative. But do they also explain, why this (Google's results) is a good thing? From my experience, Google's results beyond the second page are never useful, so they may as well not be there at all.
I don't see, how NCSA's findings can prove or disprove's Yahoo's earlier claims.

--
In Soviet Washington the swamp drains you.
Re:The results by Anonymous Coward · 2005-08-15 19:26 · Score: 0

Well what did you expect? Why do you think it was posted by a Google employee?

I personally prefer Yahoo! after this index update. I'm happy with the search results Yahoo has been giving me.

English Language by morcheeba · 2005-08-15 06:18 · Score: 3, Insightful

They only used words from the English Ispell word list. Besides the english-language bias, this is probably limited in other ways. News websites use a limited vocabulary, but a lot of proper names -- so if one engine indexed these better, they wouldn't necessarily get a better rating. News sites are also very dynamic and have a large number of webpages, so they would be influential in the count.

--
HIV Crosses Species Barrier... into Muppets

Re:English Language by Anonymous Coward · 2005-08-15 06:38 · Score: 0

... If one of them indexed a site with proper nouns better it would skew the results. Wow, you really had to stretch for that nugget.
Re:English Language by Anonymous Coward · 2005-08-15 06:49 · Score: 0

yeah, but he provided a theory as to why this might be significant.

You, on the other hand, just did a half-assed summary of the comment. It must have taken a lot of effort on your part - I commend you.
Re:English Language by Winkhorst · 2005-08-15 07:11 · Score: 1

The exact idea occurred to me and I tested it with my own name. The estimated results gave Yahoo a huge lead, but when I actually checked out the number of the last entry, they were about even. Testing other names I found that Google had a lead varying from small to x2. So I wonder if Yahoo isn't using some peculiar algorithm to estimate how many pages they have and that algorithm is way off.

--
"Is this Winkhorst a nova criminal?" "No just a technical sergeant wanted for interrogation."
Re:English Language by Krach42 · 2005-08-15 07:19 · Score: 1

You ignore the problem of transliteration.

"George Bush" transliterated to Arabic characters, or Hebrew or Cyrillic characters will not return the same results.

Let's drop into even just Latin searches. English and German transliterate Russian names differently. During the Ukraine election debatacle, I was keeping track of it through spiegel.de This caused me to not be able to readily recognize the relavent names in US news, because they were spelt differently.

--

I am unamerican, and proud of it!
Re:English Language by ryanov · 2005-08-15 07:52 · Score: 1

I searched myself. I got 797 on Google and 341 on Yahoo! FWIW.

Yahoo returns dupes... by Marnhinn · 2005-08-15 06:18 · Score: 3, Insightful

Yahoo returns a lot of dupes.

They may have more unique information simply futher down the result list, but since the search engines terminate the results at not quite 1k (1,000), the researchers have no way of testing that out.

All they can really show is that google returns more unique results per 1000 (which usually means that more items are indexed, but could be from Google's Pagerank also)...

--
There is always a frontier where there is an open and willing mind

Re:Yahoo returns dupes... by Anonymous Coward · 2005-08-15 06:23 · Score: 5, Funny

Yahoo returns a lot of dupes.

If that's the case, then why is Google the darling of slashdot? ;)
Re:Yahoo returns dupes... by NickFortune · 2005-08-15 07:10 · Score: 2, Insightful

Yahoo returns a lot of dupes.
Interestingly however, for the search results analysed, google performed noticeably better whether dupes were included or discarded.
They may have more unique information simply futher down the result list, but since the search engines terminate the results at not quite 1k (1,000), the researchers have no way of testing that out.
That isn't actually what they did. They only analysed results that scored less that 1000 results on both google and yahoo. If either engine scored over that, the results were discarded.
So, for every search analysed, the full results from each engine were always considered.
All they can really show is that google returns more unique results per 1000
Errm, nope. You could make a case for the study only showing that google performs better where information is scarce - but that's exactly when you want a good search engine, so I'm not too worried. There's a limit to how many Britney Spears links I can find a use for.
(which usually means that more items are indexed, but could be from Google's Pagerank also)
Well, the researchers provide links to the perl script and the dictionary used and also a log of the search results. If you think they're skewing the results, or just that they've made some logical errors in the study, you have all the materials you need to make a detailed refutation, or to repeat their experiment and release your own findings.
And if you really believe the study is flawed then I encourage you to do so,

--
Don't let THEM immanentize the Eschaton!
Re:Yahoo returns dupes... by AaronLawrence · 2005-08-15 07:26 · Score: 1

RTFA. They have a sensible methodology to avoid this problem.

--
For every expert, there is an equal and opposite expert. - Arthur C. Clarke
Re:Yahoo returns dupes... by Fareq · 2005-08-15 09:18 · Score: 1

the trouble is, I could make a search engine that returned 999 results no matter what you searched for...

the thing is, this study does not (and really can not) measure the more important thing -- what percentage of results are actually *relevant* to the material searched for.

Could be that Google's "smaller" index is searched by a less picky search tool that gives more results because it doesn't sucessfully eliminate as many useless pages.
Re:Yahoo returns dupes... by NickFortune · 2005-08-15 09:42 · Score: 2, Insightful

Could be that Google's "smaller" index is searched by a less picky search tool that gives more results because it doesn't sucessfully eliminate as many useless pages.
Could be. And it could be that Google's results are both both more numerous and of better quality. The tests did not, as you quite rightly point out, consider the relevance of the results. As is proper, the researchers make no claims regarding relevance.
On the other hand, their findings to cast doubt upon Yahoo's claims regarding index size.

--
Don't let THEM immanentize the Eschaton!
Re:Yahoo returns dupes... by monkeydo · 2005-08-15 10:00 · Score: 1

On the other hand, their findings to cast doubt upon Yahoo's claims regarding index size.

These findings don't do anything of the sort. In fact, Google could have only 999 pages in index, and if it returned all 999 for every query it would have won this test. There's too many assumptions here for the results to be useful.

--
Si vis pacem, para bellum
The only thing more annoying than a Libertarian is an (un|mis)informed Libertarian
Re:Yahoo returns dupes... by alpha_foobar · 2005-08-15 10:48 · Score: 1

I think that the study is flawed. The study only analyses results to queries that return less than 1000 results for both engines.

But the query itself only uses two words, hence it seems likely to me that a very small percentage of the query sample is actually useable and therefor the results of the study can not accurately be correlated to represent the entire engine index.

It seems to me that the only way to check the index size is to allow all random queries and use the number of results returned from the engine.

Obviously this assumes that the engine does not falsify the size of the result set. But all testing assumes that the results actually relate to what you are looking for.
Re:Yahoo returns dupes... by NickFortune · 2005-08-15 10:49 · Score: 3, Insightful

On the other hand, their findings to cast doubt upon Yahoo's claims regarding index size.
These findings don't do anything of the sort. In fact, Google could have only 999 pages in index, and if it returned all 999 for every query it would have won this test. There's too many assumptions here for the results to be useful.

'Scuse me: I said "cast doubt upon" not "conclusively disproved".
If Yahoo's indices are, as they claim, more than twice the size of Google's, then we might reasonably expect them to return more hits for an arbitary query. That they do not do so suggests that Yahoo may well be telling fibs.
Yes, there are other explanations, like for example, Google deliberately falsifying all sub 1000 hit queries, as you point out. However, one likely, arguably the most likely explanation is that Yahoo is being a bit sparing with the truth in its press releases.
Hence "cast doubt upon".

--
Don't let THEM immanentize the Eschaton!
Re:Yahoo returns dupes... by Fareq · 2005-08-15 10:56 · Score: 1

Oh, it's very possible that you are right.

In fact, given my personal experiences with Yahoo's search, I have considerable anecdotal evidence that says the same thing.

However, if I made Ye Ultimate Search Engine that had 100 billion pages in the index, but was also such an extremely awesome search engine that it never, ever, ever showed a page that wasn't highly relevant to the topic being searched for, it is likely that I would show, on average, fewer pages than either Yahoo or Google.

Say Google has 8 billion pages, and the average page has 4500 keywords that return it.

Suppose that yahoo has 16 billion pages, but each has, on average, only 400 keywords that return that page as a result.

Google would win this competition even though Yahoo had the larger database.

My point wasn't that I think Yahoo is better than, or even as good as, Google -- just that I don't find the experimental results from this study to be meaningful in the debate.
Re:Yahoo returns dupes... by Anonymous Coward · 2005-08-15 11:02 · Score: 0

Search for "blogorank" both in Google and Yahoo.

Google today doesn't know it exists.

Yahoo returns over 15 results (where the word DOES appear).

You do the math.
Re:Yahoo returns dupes... by NickFortune · 2005-08-15 11:23 · Score: 1

I think that the study is flawed.
It's far from conclusive, certainly. All the same, the results are not what we might expect from Yahoo's claims about index size. That's the only value of the report, really, to call into question Yahoo's claims.
But the query itself only uses two words, hence it seems likely to me that a very small percentage of the query sample is actually useable
Usable for what purpose? They are random combinations. Both engines get the same words. It seems reasonable to expect hits to be in proportion to index size.
the results of the study can not accurately be correlated to represent the entire engine index.
mmm... but accuracy is a relative concept. How accurate do we need to be in this case? Yahoo claims larger indices by a factor of around 2.5, and yet search results seem to be two thirds smaller.
We have an order of magnitude difference. That seem more than enough to justify questioning Yahoo's claims.

--
Don't let THEM immanentize the Eschaton!
Re:Yahoo returns dupes... by NickFortune · 2005-08-15 11:31 · Score: 1

My point wasn't that I think Yahoo is better than, or even as good as, Google -- just that I don't find the experimental results from this study to be meaningful in the debate.
Fair enough. I'll cheerfully conceed that this says nothing about which is the better engine. I just have a hard time squaring the results with Yahoo's claims re: index size.
It's all a bit beside the point really. My choice of search engine owes nothing to reported index size. I just don't like being lied to, and there seems a real possibility that Yahoo lied about those indices.

--
Don't let THEM immanentize the Eschaton!
Re:Yahoo returns dupes... by monkeydo · 2005-08-15 11:50 · Score: 1

Quite correct. And the mode you describe is exacerbated if one searches for random combinations of words. For many of the random queries, there are zero useful results. Maybe Yahoo! just doesn't index word lists.

--
Si vis pacem, para bellum
The only thing more annoying than a Libertarian is an (un|mis)informed Libertarian
Re:Yahoo returns dupes... by alpha_foobar · 2005-08-15 12:40 · Score: 1

Agreed, you can question the claims that yahoo has double the indexing of google. If you assume that the returned result set returned from any two keywords is always linear.

Statistically it seems like this is hardly ever the case. One would expect some sort of bell shaped curve is more likely.

It is my experience, that very few queries return less than 1000 results, especially when 2 word combinations are used. It is also my experience that Yahoo consistently returns larger result sets than google. However I suspect that google returns more specific results.
Re:Yahoo returns dupes... by NickFortune · 2005-08-15 13:06 · Score: 1

Agreed, you can question the claims that yahoo has double the indexing of google. If you assume that the returned result set returned from any two keywords is always linear.
At the risk of seeming thick, linear with respect to what? What are the axes on this curve?
I expect the Y-Axis is going to be number of hits. What's the X-Axis? number of pages indexed? I just can't visualise it...
In any case, I'll grant that one question raised is most certainly "how useful is index size as a metric for search engine usefulness?"

--
Don't let THEM immanentize the Eschaton!
Re:Yahoo returns dupes... by aoeuid · 2005-08-15 14:06 · Score: 1

google.ca returns 11 results (2 before dupes), and ca.yahoo.com returns 5.
Re:Yahoo returns dupes... by alpha_foobar · 2005-08-15 14:31 · Score: 1

well in proving that i am thick, and in doing so hopefully becoming not quite so thick.. i was imagining x being the number of hits and y being the number of queries returning this number of hits...

so i'd presume very few queries return 0-2 results, and likewise very few return more than 100 million results... though lots would return around 500,000 results.
Re:Yahoo returns dupes... by sam1am · 2005-08-15 14:58 · Score: 2, Insightful

As soon as I read this, I had the following thought...
One interesting statistic would be the number of searches for which Google had over 1,000 results, compared to the number of searches for which Yahoo had more than 1,000 results.

If Yahoo caused 80% of the "over-popular result" discards, well, I'd say that would be highly relevant.
But then I read footnote [3]:
[3] In a small number of cases, one search engine (almost always Google) will return results over 1,000 while the other search engine will not. Although we discard this data, we recognize that the data is meaningful and we hope to refine our code to take this into account. However, since the frequency this occurs is small (and almost always favoring Google) we do not feel it changes our findings.
I'd still like the statistics, but this resolves one of my concerns with the methodology.
Re:Yahoo returns dupes... by NickFortune · 2005-08-15 19:02 · Score: 1

well in proving that i am thick, and in doing so hopefully becoming not quite so thick.. i was imagining x being the number of hits and y being the number of queries returning this number of hits...
That seems reasonable to me. I think I'd still expect the amplitude of the curve to be proportionate to the number of pages indexed, though.

--
Don't let THEM immanentize the Eschaton!
Re:Yahoo returns dupes... by wanion · 2005-08-16 11:09 · Score: 1

From google.com:
Results 21 - 28 of 28 for blogorank. (0.05 seconds)

Admittedly, there's only 14 results with dupes off.

Hrmm by T3kno · 2005-08-15 06:18 · Score: 2, Interesting

Why wget instead of LWP?

--
(B) + (D) + (B) + (D) = (K) + (&)

Re:Hrmm by glwtta · 2005-08-15 06:36 · Score: 1

Why not? Often wget is faster to set up, since it already has a whole lot of functionality rolled in that you'd have to do by hand with LWP.

--
sic transit gloria mundi
Re:Hrmm by molarmass192 · 2005-08-15 06:54 · Score: 1

I don't know if this post is serious but it's probably because wget works standalone and provides a heck of a lot of functionality out of the box without coding anything. I'm not a big perl fan but I do think cpan is one impressive collection of work, I wish other prog langs would follow that example.

--

Good people do not need laws to tell them to act responsibly, while bad people will find a way around the laws-Plato

Queries with 1,000 results by Whafro · 2005-08-15 06:19 · Score: 3, Interesting

TFA notes that queries with greater than 1,000 results were dropped from the survey, because Google and Yahoo both truncate their results to 1,000.

That makes sense, but it does stand to reason (or, at least, to my reason) that these queries that garner large numbers of results could have had a significant impact on the bottom line of the survey.

Those could be the larger sites, where Yahoo is perhaps digging deeper, requesting data from forms, ignoring robots.txt, etc. It could be where they're getting those big claimed numbers of indexed documents.

This is what matters... by Rolan · 2005-08-15 06:20 · Score: 1

This boils down to the real numbers that matter. It doesn't really matter if your index is "bigger" or not, it is about the results that are returned. The other thing that matters (and can't really be measured in a scientific manner) is relevance. It's easy to return results for a set of words, it is hard to return relevant results for a set of words. My personal experience is that Google returns more relevant and better ordered results than Yahoo!.

--
- AMW

Re:This is what matters... by FragHARD · 2005-08-15 10:00 · Score: 1

> you're certainly right there I have never seen so many relevant results... but wait when I start clicking on links I seem to get forwarded to just a few different sites??? yet they all appear different if the results listing *mE thInks HOW StrAnGE* could it be a conspiracy or is this just another example of how google makes more$$$.

FragHARD or don't frag at all

--
FragHARD or don't frag at all

The ultimate test by kevin_conaway · 2005-08-15 06:21 · Score: 1

To me, the test is googling myself and seeing what comes back. Google seems to favor mailing lists high in its results so all the stupid things I've said over the years are right up there on front. Of course, I think Google is more accurate because things actually attributed to me show up higher in the results, but is that actually correct? I don't know.

Re:The ultimate test by Vegeta99 · 2005-08-15 06:33 · Score: 1

Ha! Yeah. According to Google, anyway, Plug N' Play is satan, and I really dispised MP3 players (in favor of MD players).
Re:The ultimate test by jandrese · 2005-08-15 06:54 · Score: 1

If it's any consolation, I hated those early MP3 players too. I mean what's not to like about 8MB of fixed non-upgradable storage on your music player? Especially when 2.5MB of that is taken up by the OS.

On the other hand I've always hated MD players. Closed proprietary formats suck.

--

I read the internet for the articles.

Good article by eth00 · 2005-08-15 06:22 · Score: 0, Flamebait

The researchers in this article took as close to a scientific method as one can get for something like this. This just tells us exactly what has been know for away, yahoo just plain sucks at giving good results.

Re:Good article by darius779 · 2005-08-15 06:24 · Score: 1

The article gives no information as to the quality of the results, just the number of the results given..
Re:Good article by amliebsch · 2005-08-15 06:33 · Score: 1

ERROR: LOGIC FAILURE
Returning fewer pages does not necessarily mean poorer search results - after all, a good search will present the maximum number of relevant pages, but no others. Google only wins if all of the extra results it shows are actually relevant. By the methods of this test and your analysis, I could write a search engine that returns its entire index as the result set for every search, and it would be the best websearch ever! Billions of results on every search!
I would like to see an objective qualitative assessment.

--
If you don't know where you are going, you will wind up somewhere else.

Quality of Results by Anonymous Coward · 2005-08-15 06:22 · Score: 0

The big flaw in this test, IMO, is that it assumes quantity of results is as good as quality of results. I couldn't care less if a search results in 10,000 hits or 100,000 hits. All I really care about is did it return the 1 or 2 hits that actually have the information I'm looking for and are they high up in the results?

"Number of documents indexed" is a worthless pissing match as far I'm concerned.

Quality not quantity by ngunton · 2005-08-15 06:23 · Score: 1

Surely it's the quality of the results that counts, rather than the quantity? Who needs 1,000,000 matches anyway, when most people don't go past the first page or two of the results? The article doesn't talk at all about how relevant the matches were. I'm not saying that it invalidates their study, but I would say that any search engine that returns millions of hits for any query is simply showing off. Give me a search engine that shows me fewer matches, but the best hits anyday. Lately, Google has increasingly been giving me a bunch of useless links when I search for stuff. For example, looking for reviews on various bits of hardware just gives you a bunch of websites that are selling the products, and *seem* to have reviews, but then you go to the page and it says something like "no reviews have been posted". Lots of ghost towns out there on the web these days. Anyway, the point holds: Give me relevant results and allow me to screen out the marketing junk and link farms. Beyond that I don't really care how many pages they have in the index.

Concede by DrugCheese · 2005-08-15 06:23 · Score: 0, Redundant

In fact, in the 10,012 test cases we ran, only in 3% of the cases (307) did Yahoo! return more results. In 96.6% of the cases (9,676) Google returned more results.

I don't understand what would make someone want to compete against Google anymore. Sure if you're got technology in place like yahoo keep it going but still ...

Google is synonymous with searching the internet.

Google is a verb

--
*DrugCheese rants*

Re:Concede by Anonymous Coward · 2005-08-15 06:31 · Score: 0

Actually no. There are many problems with Google:

* Their image search is out-of-date.
* Google sometimes can't find content you know is on the web and you know is indexed.
* Google's special operators sometimes don't work (like link:). This may be related to the previous point.
* Google will sometimes crash. Unusual, but I think that everyone has seen one of their "core dumps" with the encoded data and a request to file a bug report.
Re:Concede by Anonymous Coward · 2005-08-15 06:31 · Score: 0

Competition is ALWAYS good. It is what will keep google "honest" in the long run. Now that google is a public company, EVENTUALLY money will corrupt the company and they won't be the glorious tech savy company they once were. While competition won't keep any company "honest", it at least provides an incentive to a company to keep their customers happy.
Re:Concede by Anonymous Coward · 2005-08-15 06:41 · Score: 0

Yahoo! is one too, just didn't catch on. Don't remember the "Do you Yahoo!" of the internet boom days?

Conclusion. Yahoo picked a shitty name.
Re:Concede by Anonymous Coward · 2005-08-15 07:16 · Score: 0

Well, your points are probably all true, but largely irrelevant. How many "average" people know enough to use/notice that? I know for myself personally: I only use image search to find generic pictures, I've never noticed the second, I only very rarely use special operators anyway, and I've never had the last occur, and yet I use it frequently. There's no doubt Google can still improve in various ways, but these "many problems" - who cares, really? It works very well for what I need it for, which can't be said of any of the others I've tried, at least not to the same degree.
Re:Concede by dyefade · 2005-08-15 07:33 · Score: 1

Do you propose that anything hard isn't worth doing? Maybe competing with Google seems ridiculous now, but if a company has the tech and starts small but with high ambitious...

Like everyone else, I like google way too much, but I'd never say to someone, "ah, don't even bother, you'll never compete with google".

Conflict of interest? by Anonymous Coward · 2005-08-15 06:24 · Score: 1, Interesting

It seems to me that when Slashdot publishes an article that is favourable to Google, that was submitted by a Google staff member, one might question whether someone involved has a conflict of interest. It's not astroturfing, because his employment at Google was clearly mentioned. It might be an ad (or more correctly, a press release) masquerading as news. I wonder if the article would have been published had it been submitted anonymously...

Re:Conflict of interest? by 99BottlesOfBeerInMyF · 2005-08-15 06:49 · Score: 1

It seems to me that when Slashdot publishes an article that is favourable to Google, that was submitted by a Google staff member, one might question whether someone involved has a conflict of interest.

It just so happens that a lot of the news about a given company comes to the attention of the people in that company. Should Slashdot not allow submissions from posters that regard products or services they are working on? So long as it is news and affiliations are disclosed what's the problem?

It might be an ad (or more correctly, a press release) masquerading as news.

Were articles about Taco Bell putting a big target in the ocean for the space station to hit ads or press releases? They were certainly helping Taco Bell's business and that was the intention of Taco Bell. But if the article is not written by them, then it is not a press release, just a story that might help their sales.

So here's the thing. This is news a lot of people here are interested in reading. Slashdot editors did not post the Yahoo claim because they wanted to help Yahoo and they did not publish this story to help Google. It is just a story of interest to the readers. Most stories have some spin on them these days. Knowing the authors/editors/publishers is important. Knowing that the person who submitted it to this news site works for Google is being excessively upfront.

I wonder if the article would have been published had it been submitted anonymously

Anonymous submissions are less likely to be published. I bet it would have been published if a random author submitted it though.

But the real question is... by convex_mirror · 2005-08-15 06:24 · Score: 1

Why is the NCSA cowering from comparing Google and Yahoo to infoseek? The wool has been pulled over your eyes people!

My own independent analysis by Anonymous Coward · 2005-08-15 06:24 · Score: 0

No one gives a fuck whether it is 8.16 billion or 20 billion. No matter what, it is 99.9999% useless shit. Is the largest catalog of useless shit really something to aspire to?

Perl Code by hayro · 2005-08-15 06:24 · Score: 4, Funny

I don't know about the study but that is the most readable perl code I have seen in a long time.

Re:Perl Code by Anonymous Coward · 2005-08-15 11:15 · Score: 1, Insightful

I know you are trying to be funny, but that code is acutally not very good by any modern standard. Fork and open wget instead of using LWP; lots of printf's instead of using heredocs or templates; unnecessary use of C-style for loops...
Re:Perl Code by Anonymous Coward · 2005-08-15 12:18 · Score: 0

dude as far as i'm concerned he's more than being funny. he is right.

he said "is the most readable perl code I have seen in a long time"

he didn't say it was good code.

i look at perl code day in and day out and as far as i'm concerned the entire language is one misunderstood piece of...

i personally can't read my own perl code 5 days out and i'm an overly experienced c++, c, perl, etc etc (you don't want my resume)... programer.

so he messed up a fork. good for him, cause i pitty the fool who thinks they can do safe multithreaded programming on perl and sleep() at night.
Re:Perl Code by pfafrich · 2005-08-15 12:40 · Score: 2, Insightful
Readable code because:
- Well laid out and indented
- Long and meaningful function and variable names
- Good logical structure, no fancy tricks
- It looks like C!
--
There are four sorts of people in the world: fools, lunatics, idiots and morons. - Umberto Eco, Foucaut's pendulum.
Re:Perl Code by Anonymous Coward · 2005-08-15 19:05 · Score: 1, Funny

Mmm, getppid() is unimplimented on some platforms, so it could've handled that better in the call to srand(), but hey, it's still usable and it looks well-written :)

interesting but inconclusive by it0 · 2005-08-15 06:24 · Score: 1

It's a nice test but ifail to see how they can extrapolate this to be true for all searches.

Don't forget that also a lot of queries get handtuned at google/yahoo to give the proper resultset.

Also to keep in mind that size doesn't matter but relevancy does!

And they both cheat at that as well, they just give back the highley ranked pages for those words. Works ok for a lot of people but hardly relevant.

More please! by 2008 · 2005-08-15 06:25 · Score: 5, Interesting

This is a great article! I wish there were more like it on slashdot. It's scientific instead of an opinion piece, it has references, it's repeatable. It's also short and very readable, unlike a lot of science papers.

OK, it is yet another Google piece, but it's not "some junior analyst predicts Google will buy Apple and release OSX86box 720".

--
I quit!

Re:More please! by Anonymous Coward · 2005-08-15 06:40 · Score: 0

........

*bursts out laughing*
Re:More please! by Overzeetop · 2005-08-15 07:00 · Score: 2, Funny

Oh, please don't ask for "more like this". It just gives the editors a reason to think that there is a hardcore contigent of /. readers who crave dupes. I mean, how can they get more "like this" than to simply repost it in a couple of hours.

--
Is it just my observation, or are there way too many stupid people in the world?
Re:More please! by tyler083 · 2005-08-15 07:04 · Score: 1

agreed.

always nice when someone puts a test out with the files they used so we can play with it, too.
Re:More please! by Anonymous Coward · 2005-08-15 07:24 · Score: 0

Why? As pointed out by several posters, this "study" has severe methodological flaws that render the conclusion no better than an opinion piece. What it has going for it: a fake veneer of scientific authenticity and being "short and readable."

There's a lot of good stuff on slashdot, opinion pieces included. This one is no better for its pretense at science.
Re:More please! by mars_rover · 2005-08-15 11:36 · Score: 0

some junior analyst predicts Google will buy Apple and release OSX86box 720
...but Yahoo is 3.6 times more likely to buy Apple first.

Study has poor assumption by Anonymous Coward · 2005-08-15 06:25 · Score: 2, Insightful

The study noted that although Yahoo says that have ~twice as many pages indexed as google, when they queried each engine with two arbitrary words from the dictionary, they got less responses from Yahoo.
From this they concluded yahoo's claim of twice as many pages is suspicious.

What's suspicious is that these people consider themselves scientific. What if, for example, Yahoo just returns meaningful results, whereas google returns anything with those words in? For example, what if you search for "faience" and "urbanity" -- maybe google has more results, but maybe they are less pertinent - in other words maybe not only Yahoo has more pages indexed, but they have an algorithm that returns only the most relevent stuff

Not saying that's the case necessarily, but not mentioning that assumption makes for a worthless study/conclusion. (also if google says they return x results, often when you go to the last page of their results listing you'll notice their total went down, and its more like x - 10%)

-Josh

Interesting but... by kf6auf · 2005-08-15 06:25 · Score: 2, Insightful

While it is true that more results could mean worse filtering, that is a separate test entirely.

I tend to think that ordering is more important than filtering down to a small number of results, since having lots of results returned doesn't hurt if the search engine can order well so that what you want is most likely to be in the top 10-25. This is especially true when there will be at most a couple of results where I'd rather have the search engine try at the ordering and have me do most of the filtering because no search engine is as good as a person at really figuring out what people want, yet.

Re:The difference is by Anonymous Coward · 2005-08-15 06:26 · Score: 0

How many more times are you going to whore your site in your comments? Out of 9 comments you've made on /. 5 of them have included a link to your site in the body of the comment. If it's relevant, fine, if not then stick it in your sig or profile.

Methodology by enjo13 · 2005-08-15 06:26 · Score: 5, Insightful

The very methodology used in this case seems rather incorrect to me.

The assumption (as stated in the paper): Since Yahoo claims to have indexed twice as much as google, searches should return twice as many entries.

That assumption is flat out incorrect. There are actually multiple problems.

First, the scope of the search (based on index terms) is really up to the search engine itself. Since each search engine does not return the entire database as search results, it is very much up to the individual search algorithm to determine the depth of entries considered to 'match' a set of terms. That's what is really being reflected in these results.. it is not the overall size of the index, but simply how aggressive the search algorithm is in matching terms to entries.

Even if the algorithms where identical (same algorithm being run across both indexes), the nature of search does not scale in that way. If Yahoo has, for instance, becomre more aggressive in indexing message board and forum content, then only searches that play to those subjects should return more results than Google. Since searches are by definition narrowing on a data set, a methodology needs to be developed that more effectively tests the BREADTH of the results more than simply testing the depth.

--
Turn s60 photos into awesome videos with mScrapbook for all S60 3rd edition phones!

Re:Methodology by Anonymous Coward · 2005-08-15 08:08 · Score: 0

That's odd, because (roughly) this method is more or less the industry norm(*) for estimating index sizes. Of course, the search engine companies do a lot more to measure effects that can influence the results, but as a rough estimate, the method is indeed scientifically valid.
One way it has been verified is by having search engine companies perform these estimates on their own indices to see if indeed index size increases are measurable.
(*) In the search engine industry.
Re: Methodology by gidds · 2005-08-17 00:08 · Score: 1

The assumption (as stated in the paper): Since Yahoo claims to have indexed twice as much as google, searches should return twice as many entries.

Hey, I've got a brilliant idea for a new 'search engine'! I'll collect a list of 10,000 sites -- a tiny fraction of what the others have, so it must be easy -- and return every one of them, no matter what people are searching for. According to their methodology, that must mean that my engine is abso-flippin'-lutely fantastic!!!

--
Ceterum censeo subscriptionem esse delendam.

Quality fo Quantity? by imstanny · 2005-08-15 06:26 · Score: 1

I'd be much more interested to see a test of the quality of results. Considering that most of the results that I end up activating are on the first page, quantity of results is less relevant to me in determining a good search engine.

Re:What a surprise by Eric604 · 2005-08-15 06:27 · Score: 0, Troll

OK, mod me flamebait now.

I'll take some of the heat off you. Let's burn some karma. Here we go. MODERATORS ARE STUPID FUCKERS.

International Listings by Dominatus · 2005-08-15 06:27 · Score: 4, Insightful

The study only checked English words. Is it possible that the increase came from Yahoo expanding into more international website markets?

Just a thought

Re:International Listings by Dusabre · 2005-08-15 21:24 · Score: 1

And google doesn't index non-English sites?

Think about it. English is the best representative language but they could have used Turkish, Polish or Spanish, any language with a large enough web-presence.

Note to moderators, this isn't insightful.
Re:International Listings by Dominatus · 2005-08-17 06:17 · Score: 1

You didnt read what I said:

If yahoo's additions were primarily non-english they wouldnt be reflected in an english test. If Yahoo added 10 billion non english sites, then this test wouldnt have picked up any of it.
Re:International Listings by Qazimov · 2005-08-22 05:33 · Score: 1

the interweb is written in english. Thank you, that is all.

can we trust the methodology by GabrielF · 2005-08-15 06:27 · Score: 1

Basically NCSA's method assumes that if a search engine indexes twice the number of pages, than it will return twice the number of results for a given search. However, in order for this to be the case, the 10 billion+ more pages that yahoo indexes would have to be roughly equivalent to the pages that google indexes. If Yahoo is indexing 20 billion pages, but ten billion of those are in mandarin, than searching for random combinations of english words (which NCSA is doing) won't tell us which search engine indexes more pages. In order to trust NCSA's methodology we would have to know exactly WHAT the billions of pages that Yahoo knows about but Google does not are. Surely the web didn't double in size overnight, Yahoo must be searching somewhere Google doesn't search if their claims mean anything (which they may not).

Re:can we trust the methodology by NoOneInParticular · 2005-08-15 09:09 · Score: 1

Basically NCSA's method assumes that if a search engine indexes twice the number of pages, than it will return twice the number of results for a given search.
Not exactly true. Please note that when using the two search engines for regular (common) queries, Yahoo explicitly claims that it finds twice the matches of Google. Try it. This seems to rule out the mandarin option.
I don't think NCSA's methodology is conclusive, but I do think the conclusion is true that Yahoo's advertisement on its search front page is suspicious.

Finally! by Cheirdal · 2005-08-15 06:27 · Score: 1

It's good to see that slashdot is FINALLY posting an article about Google.

Not really by Mr.+Underbridge · 2005-08-15 06:27 · Score: 1

In fact, all results that match a query are returned, it's the ranking that matters. Google is also more rigorous about excluding apparant duplicate results, and don't count those in the stats.

This is what passes for CS research nowadays? by adrizk · 2005-08-15 06:28 · Score: 5, Insightful

Seriously. 'We wrote a script and here are the results'? This would take an average PERL programmer what -- 30 minutes of work? Has academic research in computing really sunk to this level?

Maybe it's not even worth pointing out how badly flawed (and lazy) the underlying assumption of 'twice the results = twice the index size' probably is, as I'm sure we're going to see a few dozen posts to that effect (unless PageRank really means nothing), but at least I can complain about the slant they put on this, and how strong a conclusion they seem to derive.

Re:This is what passes for CS research nowadays? by DogDude · 2005-08-15 06:38 · Score: 1

Has academic research in computing really sunk to this level?

Considering that most people call a "fact" something that they found on Wikipedia or via Google, I'd have to say that the answer to your questions is "yes". The Net is a vast source of incorrect, incomplete, and otherwise bad data. There may be a lot of information out there, but the vast majority is wrong. This "cheapening" of information has and probably will lead to more of this crap "research".

--
I don't respond to AC's.
Re:This is what passes for CS research nowadays? by 99BottlesOfBeerInMyF · 2005-08-15 06:59 · Score: 1

but at least I can complain about the slant they put on this, and how strong a conclusion they seem to derive.

Did you RTFA? Their conclusion is based upon their results which was the best they could do without access to the systems and with limited resources. And what is the conclusion that you complain about the spin on? The conclusion is that yahoo's claim is suspicious. I'd say that is a pretty solid claim. Yahoo's assertions are suspicious and while they could be true, are worth questioning in light of a sampling the results of both engines.

no change for me by rotagivan · 2005-08-15 06:28 · Score: 1

Do you really even have to RTA? My search engine is still the same as before and works fine, no need to change now. Awe, is Yahoo jealous?

Interesting study... by dracken · 2005-08-15 06:28 · Score: 1

...though flawed in many respects. The raw number of pages returned may not indicate the size of indices. Google is famous because it returns *relevant* pages but not necessarily *more* pages. A search engine that returns its entire index with each search isnt all that useful.

Secondly, results for all keywords may not increase with the size of the index. The pages which were indexed might correspond to popular searches (that return more than 1000 results, which were not considered if you RTFA) - so considering only those words that return less than 1000 results is flawed.

Though some competition is good, the "DO YOU WANT MY 20 BILLION BIG INDEX ???!!" claim by yahoo reminds of certain yahoo chat rooms :p

yahoo failed it by Anonymous Coward · 2005-08-15 06:28 · Score: 0

yahoo forgot to index all /. dupes.

methodology by abde · 2005-08-15 06:29 · Score: 1

the assumptoins seem to be that sarch results are randomlydistributed. But by teh very nature of search - a targeted and subjective request for information - that is clearly the wrong model. I don't se why the assumption that a 2x bigger index should return 2x more results for any query 1000.

A better test would be to see how much overlap there was between queries. Do the top 50 returns on queries (ofany size, not just imited to those with N 1000 returns) match? to wuithin what percentage?

--
Don't blame me - I voted for Howard Dean. http://dean2004.blogspot.com

Google parses plurals differently. by WoTG · 2005-08-15 06:30 · Score: 3, Interesting

Google started treating plurals as the same search about a year ago. Yahoo doesn't. So, if you google for "inkjet printers" and "inkjet printer" you will get the same result set; however, on Yahoo, you will get different results.

The net result is that for the same index size, Google will return more results. (And, IMHO, more meaningful ones.)

Who cares about... by Ignignokt · 2005-08-15 06:30 · Score: 2, Insightful

the number of results anyways? Who makes it to page 5000 when doing a search?

Questionable methodology by lpangelrob · 2005-08-15 06:30 · Score: 1

I agree that it's hard to determine how many items that exist for a subject XYZ, but I'm not sure this is the way to go about it.

They presumed that for random phrases that return less than 1,000 matches, one can determine between the ratio of matches that Google returns and matches that Yahoo returns, which engine has indexed more documents. This also presumes that the Internet is an infinite source of information about XYZ, and that there is always an indeterminate number of sources that remain unindexed on both engines. I don't think this is the case at all.

Say I write a page about Jabberwocky. I get together with people that write more pages about Jabberwocky, and all of us have on three domains information about Jabberwocky that exists nowhere else, except maybe Wikipedia under the Jabberwocky entry. If both sites index Wikipedia and those three domains (that link to each other), that's 100% coverage... barring horrible algorithms, you can't get less than this, or you get nothing at all.

Also, when you're looking around for such unique information, I have to imagine that it's not representative of other sources in more general searches.

--
-Rob

Biblical fiscal responsibility

More results == better search engine? by RunzWithScissors · 2005-08-15 06:30 · Score: 3, Insightful

So in the conclusion, the author writes that since Google displayed more results, based on their random test data, it was the superior search engine? That seems so wrong somehow...

Wouldn't a better search engine return less, but more appropriate results? I mean, how many of us have found the information we were actually looking for on page ten or twelve of a search. And, isn't less more, but better? %insert Linux geek laughs here%

One would think that volume of results would not a better search engine make, although it may indicate a larger engine index size; an expicit statement to that effect seems to be missing from the NCSA report.

-Runz

Re:More results == better search engine? by Anonymous Coward · 2005-08-15 06:55 · Score: 0

RTFA. They never said Google was better. They said that Yahoo's claims to have twice the index of Google were suspicious.
Re:More results == better search engine? by enosys · 2005-08-15 07:53 · Score: 1

The only web pages that should be excluded from the results are worthless spammy pages that are just used to generate traffic. Other than that returning less results is a bad thing. When searching for something that returns few results having more results obviously helps a lot. It might seem to be a problem when something returns a lot of results but sorting the results according to quality (like Google tries to do with PageRank) is better than just dropping some.
Re:More results == better search engine? by phriedom · 2005-08-15 09:16 · Score: 1

Nowhere did it conclude that anything was better.

The conclusion is that it seems highly unlikely that Yahoo's index is twice the size of Google's index because in 10,012 dictionary word combinations that returned less than 1000 results Google returned more results 96.6% of the time.

--
Don't moderate flamebait as Troll. Know the difference or you will be Meta-moderated.
Re:More results == better search engine? by Talennor · 2005-08-15 09:17 · Score: 1

I'll write another perl script they can check their script against. You enter a search string, and I return 999 links to websites, which means I obviously have an almost infinite database of webpages indexed. And the compression is really amazing too, it all fits into this perl script and CSV of 999 links.

--

//TODO: signature
Re:More results == better search engine? by RunzWithScissors · 2005-08-15 10:51 · Score: 1

I did RTFA. Fine the conclusionary statement is Google consistently provides more results, so Yahoo!'s claim of more indexed pages is suspicious. But my point was one of search engine design. Would not a better search engine, even though it may have more indexed pages to search, actually provide less, more appropriate matches? I am just providing an alternative explination for the data that was collected and analyzed by NCSA. Perhaps Yahoo! does have more indexed pages, but through a different search algorithm generated fewer, but better results.

If we were to apply the same logic used in this analysis to another field, say automobiles, one would conclude that a hand-crafted Shelby Cobra is an inferior auto when compared to a Ford Focus simply because there were fewer of them manufactured.

The beauty of statistics is that they have a variety of possible interpretations. I don't remember who it was that I heard say this, but I belive it was a political lobbyist, "Tell me what you want to prove, and I'll find statistics to back it up!" This statement further illudes that statistical data, while interesting, is given positive or negative connatation based on the interpretation of said statistics.

BTW, I understand that 9 out of 10 /. readers think anonymous posters are pussies. Or mabye they thought anonymous posters were pussy cats, damn you statistics!

-Runz

Quality Quantity by hagrin · 2005-08-15 06:31 · Score: 2, Insightful

This is just another example in the age old argument of which is better. IMO, the quality of the search results is what matters more than the sheer quantity of information. One relevant find is more valuable than 100 inaccurate results. A test of accuracy might be more valuable and one that would be difficult to engineer. For instance, if I type in a word that has a direct correlating .com domain, that should be the first result (assuming no other words in the title - i.e. "hagrin" brings me my home page as the first result). I am sure a test of accuracy could be further derived from such logic.

The other side of the argument probably relates back to something my fiancee once told me - "Size doesn't matter, but it's the great equalizer when it comes to two guys not knowing what they are doing". Yahoo!, especially since the researches couldn't perform queries on topics returning more than 1,000 results, may be indexing and crawling deeper into sites or it has a "double dipping" problem.

Either way, I don't see Yahoo! falsely reporting their numbers - I would tend to think that this "study" is highly flawed due to its exclusion of larger result topics, etc.

--
Hagrin.com

Problems with the research by iceco2 · 2005-08-15 06:31 · Score: 1

The research has several problems:
a. It measured number of results for a certain
query, even if we assumed identical algorithms for checking if a page matches the a query, the two search engines are likely to use diffrent relevancy thresholds.
b. the search pretty much limited itself to the
english language.
c. as they admit themselvs they measured only obscure queries, actually most of my queris
are not obscure at all and it takes me more then 2 words(which fit together) in order to chop down
the search results group.
d. finally the entire research has very little to do with the really intresting question, which is which search engine is more likely to give me the results I need on the first page?

Me.

Wait, wait, wait by antifoidulus · 2005-08-15 06:31 · Score: 1

What's this? A concise and well written summary with a link directly to the well written article? No twisting/breaking of the truth in order to incite /. groupthink comments? No pointless plugs for unrelated topics? No ADS?!?!

Jesus, the editors keep that up they might actually have a worthwhile site going....never fear, I'm sure the next dupe and/or an article comparing spooning to unmanned space travel will surface before the day's end.

--
Monstar L

WMD flaimbait? by mi · 2005-08-15 06:31 · Score: 1

Or off-topic? Or troll?

The NCSA's test neither confirms nor disproves Yahoo's earlier claims. Their lesser average results may just indicate higher quality threshold -- Google's results beyond the second page are never useful either.

I'd say, it is kind'a early to claim "pants down, egg on face"...

--
In Soviet Washington the swamp drains you.

But was the study by Approaching.sanity · 2005-08-15 06:32 · Score: 1

funded by Microsoft?

--
RTFA again for the best results.

Not only does Google do More, it does Better by Ralph+Spoilsport · 2005-08-15 06:32 · Score: 2, Informative

In regards to a similar article last week, I posted my own personal results on what I found when I did a search on Kyzyl, the capital of Tuva.

Google not only gave MORE results, it gave BETTER results. The only bad results were some hairsplitting (if largely well meant) from fellow /.ers... (I mentioned Tuva as a suburb of Mongolia, and while it IS a part of the Russian Federation, it is Much More Mongolian than Russian. And if the rising tide of neoNazi scum in Russia get their way, Tuva could easily be cut adrift into the Mongolian/Chinese orbit...but I digress...)

The essential point is: Which Does the Job Better For Me? Google. Therefore, I use Google. Assuming the Copernican position that I am not atypical, I would therefore extrapolate that this is very true for most other people as well. Which means that Yahoo has a LONG way to go and A LOT more work to do.

RS

--
Shoes for Industry. Shoes for the Dead.

Re:Not only does Google do More, it does Better by emcmanus · 2005-08-15 07:20 · Score: 1

Thank you captain obvious!
Re:Not only does Google do More, it does Better by Krach42 · 2005-08-15 07:26 · Score: 1

Assuming the Copernican position that I am not atypical, I would therefore extrapolate that this is very true for most other people as well.

You're stretching your results too far. I'd stick with "it's true for many other people" rather than most.

Of course, I fall into your same category, so I suppose the number would be 2 * "many other people"...

Please ignore my babblings...........

--

I am unamerican, and proud of it!
Re:Not only does Google do More, it does Better by snarkh · 2005-08-15 09:39 · Score: 1

Assuming the Copernican position that I am not atypical

Copernicus thought you were not atypical? How interesting.

I don't see how it can be accurate by jerryodom · 2005-08-15 06:32 · Score: 1

There is a big difference between the size of an index and the number of or quality of search results returned. Yahoo may simply not return as meany results or retards the number of results returned for speed considerations. Just because their particular test favored Google's system doesn't make it accurate. I'm sure we could sit here and think of hundreds of different reasons or considerations not taken into account.

With each having billions upon billions of documents available and indexing more everyday who really cares?

--
For some reason I refuse to use either spell check or the spacebar properly.

disregarded results by Metex · 2005-08-15 06:33 · Score: 1

Thus, for the purposes of this study, we were forced to restrict our searches to those queries that returned less than 1,000 results on both Yahoo! and Google. Any search result found to have more than 1,000 returned results on either search engine was disregarded from our sample.

my question is which search engine required them to disregard their sample the most. Did google hit the limit the most or was it yahoo?

By the way I love google but I do think yahoo indexs more pages. It index personal pages moreso then google does. So when I am searching for items which I know other people would point to I hit up google. But if I am searching for something that no one has a reason to link to (home page of your gf) I hit up yahoo.

--
Never could figure out why my girl liked my bitch tits, then I found out she was a lesbian.

Not Convincing by FreshFunk510 · 2005-08-15 06:35 · Score: 1

Thus, for the purposes of this study, we were forced to restrict our searches to those queries that returned less than 1,000 results on both Yahoo! and Google.

In order to create a large number of queries that returned less than 1,000 results, we took the commonly available English Ispell Wordlist.. and wrote a PERL script to randomly select two words at a time from that list.

Is it just me or does this study not sound convincing enough? There are too many holes in the way the study was conducted, IMHO. First of all, they restricted queries that return less than 1000 results? They're already limitied the sort of queries they're executing by choosing those that return significantly less results that other "popular" queries.

Secondly, they chose random words to create a query. This doesn't give me the confidence that this belongs to the same space of queries that people execute on the average. It would've been great if they sampled their queries from those that people actually execute instead of just crawling the english dictionary.

Nevertheless, bigger is not always better. The reason why Google became such a phenomenon was because of the quality of their search results. Duh.

--

"Injustice anywhere is a threat to justice everywhere." - Martin Luther King, Jr.

Louis Waweru - youngbonzi@earthlink.net by Anonymous Coward · 2005-08-15 06:36 · Score: 0

a picture tells a thousand words

Mcdonalds obviously isn't hiring

MODERATRORS, look here!!! by Junior+J.+Junior+III · 2005-08-15 06:37 · Score: 1

Mod parent up.

--
You see? You see? Your stupid minds! Stupid! Stupid!

Offtopic but Adsense needs work by Anonymous Coward · 2005-08-15 06:37 · Score: 0

I was on a page reading about Windows Longhorn and Google showed me ad's about Cattle in Texas I could buy... with all the 1337 hax0rz and ub3r geeks they have at Google, Inc, can they not fix the "context"

Re:Offtopic but Adsense needs work by 51mon · 2005-08-15 13:57 · Score: 1

In this case Google got the context, it was just exercising it's sarcasm module ;)

On the other hand possibly there aren't that many "Longhorn" related adverts yet in Adsense, beside I assume it is the cattle breeder who chose "longhorn" as a keyword, Google may not know what he is selling. Probably he is selling the same thing as Microsoft, except by the sack load.

Automated querying is Illegal by pooya · 2005-08-15 06:37 · Score: 1

Isn't it illegal that their crawler is running automated queries? From what I see in Google's Term of Services:

You may not send automated queries of any sort to Google's system without express permission in advance from Google.

Please do not write to Google to request permission to "meta-search" Google for a research project, as such requests will not be granted.

I'm wondering how they prevented the machines running robots from getting banned after querying that much.

--
pooyak.com

Re:Automated querying is Illegal by WillAffleckUW · 2005-08-15 06:52 · Score: 1

I'm wondering how they prevented the machines running robots from getting banned after querying that much.

If you follow the story link, you'll find the NSCA posted the source perl code - and answer it yourself.

--
-- Tigger warning: This post may contain tiggers! --
Re:Automated querying is Illegal by greenash · 2005-08-15 06:57 · Score: 1

First of all, it's not "illegal", but merely against Google's Terms of Service. Breaking a contract isn't "illegal", and sometimes it's the right thing to do.

Secondly, they might have used Google's search API-- but I can't be bothered to check the Perl code.
Re:Automated querying is Illegal by Dachannien · 2005-08-15 07:00 · Score: 1

And in this case, there's no contract to break. But Google can still IP ban you if they want.
Re:Automated querying is Illegal by pooya · 2005-08-15 07:03 · Score: 1

Well, there's nothing special there other than they query google using wget with a faked IE agent and then parsing the HTML result. Hmm, may be you mean the sleep between each queries. Yes, that may helps. But still it is against the terms of services, if not illegal, but I guess that means Google do not allow them to use its services.

--
pooyak.com
Re:Automated querying is Illegal by WillAffleckUW · 2005-08-15 07:06 · Score: 1

i meant the sleep - when looking for bots, you tend to look for regular intervals, or have a threshold between requests - to avoid being treated as a bot, use one or two of those methods.

--
-- Tigger warning: This post may contain tiggers! --

Nice an objective by Gumber · 2005-08-15 06:37 · Score: 1

Nice to take an anti-yahoo submission from a Google employee. I guess I should be happy they at least disclosed the conflict. It's more than you can say for someone like Bob "rove-puppet" Novak.

Clever idea - I think I will patent it by Anonymous Coward · 2005-08-15 06:38 · Score: 0

before someone else does.

Results of my own study... by Locke2005 · 2005-08-15 06:38 · Score: 4, Funny

Google only reports "about 4,820,000" entries for Britney Spears, while Yahoo reports "about 67,100,000" entries! This makes Yahoo more than 12 times better than google! Yeah, my methodology is completely fucked up... but then, so is the NCSA's!

--
I've abandoned my search for truth; now I'm just looking for some useful delusions.

Re:Results of my own study... by WillAffleckUW · 2005-08-15 06:42 · Score: 2, Funny

Google only reports "about 4,820,000" entries for Britney Spears, while Yahoo reports "about 67,100,000" entries! This makes Yahoo more than 12 times better than google! Yeah, my methodology is completely fucked up... but then, so is the NCSA's!

But that's because both Yahoo and Google cap results at 1000, so if you have more than that, it won't count for either engine.

--
-- Tigger warning: This post may contain tiggers! --
Re:Results of my own study... by Locke2005 · 2005-08-15 07:07 · Score: 1

As pointed out in the article, yes, both Google and Yahoo! are lying about the total number of entries. However, Yahoo is exagerating much more than Google is... short of getting full access to do a complete audit on their databases, I can't think of a way to validate either companies claims.

--
I've abandoned my search for truth; now I'm just looking for some useful delusions.

Exactly by mopslik · 2005-08-15 06:38 · Score: 1

If $SEARCH_ENGINE returns 1,000,000 results, and assuming I can sift through each result at an astonishing rate of 1 per second, it will take me 1,000,000/(60*60) = 278 hours, or 11 1/2 days to wade through the junk.

The number of results is largely irrelevant. Give me quality filtering instead. Fortunately, Google does that for the most part.

Proper name samples by jkauzlar · 2005-08-15 06:39 · Score: 5, Interesting

Let's try a few samples of proper names:

Search: Valerie Plame
Google: 908,000
Yahoo: 2,580,000

Search: "Boulder, Colorado"
Google: 1,600,000
Yahoo: 5,880,000

Search: "Linus Torvalds"
Google: 2,560,000
Yahoo: 5,870,000

I assume it goes on like this. Of course these exceed the 1000 maximum hit limit given in the study.

Re:Proper name samples by jkauzlar · 2005-08-15 06:52 · Score: 2, Interesting

Okay, here are some unlikely proper names which stay well within the 1000 maximum hit limit:
Search: "Dirk Bradford"
Google: 11
Yahoo: 15
Search: "Ronald Hendrickson"
Google: 170
Yahoo: 418
Search: "centerville baptist church" iowa
Google: 43
Yahoo: 37
Well that's less certain. It's hard finding words that return over zero but less than a thousand results...
Re:Proper name samples by squoozer · 2005-08-15 06:59 · Score: 1

I love Google. Search for squoozer: Google: 197 (estimated and a suggestion I change my name to squeezer.) Yahoo: 37

I didn't realize that I had posted that much on /.. I must get out more.

--
I used to have a better sig but it broke.
Re:Proper name samples by Zapdos · 2005-08-15 07:03 · Score: 2, Informative

From the Article:
However, in the case of Yahoo! the actual number of search results returned is only one-fifth the estimated total.

--
Get a free ipod.
Re:Proper name samples by jistanidiot · 2005-08-15 07:35 · Score: 1

At first I thought you were kidding as the study said neither search engine will return more than 1000 results. However I tried it and got similar numbers.

I'm just shocked such an obvious lie could be published here on /. The authors of the study and the /. editors should both be ashmed. I hope Yahoo! takes action.
Re:Proper name samples by Krach42 · 2005-08-15 07:40 · Score: 1

You need to watch for transliterations also though:

Michail Gorbatschow:
Google: 124,000
Yahoo: 229,000

Mikhail Gorbachev:
Google: 538,000
Yahoo: 1,760,000

Hm... it does appear that the pattern holds despite though.

--

I am unamerican, and proud of it!
Re:Proper name samples by Jonny_eh · 2005-08-15 08:05 · Score: 1

Who makes it past the first few pages of results anyways? After a certain point, the resulting web pages are meaningless. So does this mean that Yahoo returns more results, and consequently, more meaningless results than Google?

I'd be personally satisfied with a limit of 1 result, as long as it's exactly what I'm searching for. You get my point.
Re:Proper name samples by chickenrob · 2005-08-15 09:44 · Score: 1

try the word "fish"

I got yahoo:237,000,000. Google:10!?

I'm not even kidding! try it!

what the hell? am I missing something?

Google images even finds millions of images of fish, why woulden't at least all of these come up for that word? "fishing" also returns many millions of results.

I don't get it.

--
People say my sig is the best thing about me.
Re:Proper name samples by Anonymous Coward · 2005-08-15 10:02 · Score: 0

Google must have hiccuped.. I got 61 million..
Re:Proper name samples by KermodeBear · 2005-08-15 10:31 · Score: 1

I tried searching for "Boulder, Colorado" and Yahoo gave me 6,150,000 matches. I was determined to see the very last link. So, I went as far as Yahoo would let me... And stopped at link 1000. I tried a few others ("Cleveland, Ohio", "Madison, Wisconsin") and it always stopped at 1000. They also omitted results with a little link to search again including the omissions. So, it's really quite difficult to say how many pages match exactly. This is probably why they were discarding record sets returning over 1000 matches. To be fair, Google limited the search results to 750 when searching for Boulder, Colorado, but went to 992 when I included "omitted results". It would be very interesting if someone were to be given access to Google's and Yahoo's machines to run queries that matched as many pages as possible, just to see how accurate those estimations are.

--
Love sees no species.
Re:Proper name samples by SETIGuy · 2005-08-15 13:57 · Score: 1

I did a search on my name on both Google and Yahoo. Yahoo claims 47,300. Google claims 38,700. Do I care which returns more? Not in the least, since nobody is going to look past the first few pages.
The big difference is that on Google, my home page is #1. On Yahoo, #1 is an innodb bug report I filed in January. A page about memory mapped files several links below my home page is #3, another of assorted links I like is #11, and my home page isn't in the first 100.
Which search engine do you think generates more useful results?

--
Support SETI@home

Eh? That won't work by squoozer · 2005-08-15 06:40 · Score: 1

While it would be interesting to know how many pages the big search engines index this isn't a way to measure the size of them. I am ready to be proved wrong but as far as I can tell this is totally flawed.

The number of results given isn't a measure of the size of the search set unless you also know the algorithum being used. If both search engines use an algorithum is designed to just find pages with the given word and return all pages then this will work. However that isn't necessarily the case I imagine both google and yahoo will return a smaller set of pages at times of heavy load or possibly it you screw about and do 10000 queries from the same IP address in 5 minutes. To prove the fact this experiment doesn't work why don't they come and test my super wizzy search engine. It will give them 999 results for every query.

--
I used to have a better sig but it broke.

Don't even contain the search term by Midnight+Thunder · 2005-08-15 06:42 · Score: 2, Insightful

The interesting thing is that the top three results make no reference to the word failure. Of course it is probably based on pages linking to these three, but I wonder if they should even be included for the lack of the search term?

--
Jumpstart the tartan drive.

Teoma is better than google or yahoo by JeffSh · 2005-08-15 06:47 · Score: 0, Flamebait

Teoma is better than google or yahoo, so i think the point is moot.

http://www.teoma.com/

MOD PARENT DOWN by Anonymous Coward · 2005-08-15 06:47 · Score: 0

Hyperlink spam

not so fast by betsywetsy · 2005-08-15 06:48 · Score: 2, Interesting

Looking at the first item in their result log, I'm unimpressed.
Yahoo returns 0 results, and Google returns... 4 different links to the ispell dictionary (or variants thereof).
('carbolization clambers')

Re:not so fast by betsywetsy · 2005-08-15 07:18 · Score: 2, Insightful

Testing further, so far I've found dictionary files in G's results in all of the edge cases in which neither engine returns significant results, and a couple of times in Y's results.

centerable's heterolecithal
or's depigmentation
apprizer's expense
inabilities hydrocephalic
unobservable Oistrakh
apparentness nucleophile ...

At this point, I think the conclusion that you'll get more results on Google arguably stands, the methodology of the test and the idea that anything can be concluded about the relative index sizes are clearly discredited.

(Thanks, Dr. K!)

Re:Queries with 1,000 results by Nos. · 2005-08-15 06:49 · Score: 1

True, but how many times, when searching, do you look past the first 1000 results? Heck, I rarely get past the first 20 or 30 before refining my search. I don't belive the usefulness of results past even the first 100 or 200 results should be considered when comparing search engines. An interesting survery would be how many pages deep a person will look when using search engines.

Re:Queries with 1,000 results by Whafro · 2005-08-15 06:51 · Score: 1

That's irrelevant in this case, certainly. This wasn't a judgment of what the best search engine is, but instead which search engine had more results. This was strictly quantity, and not quality.

That wasn't the point of the study by Anonymous Coward · 2005-08-15 06:54 · Score: 0

They were testing Yahoo's claims to be indexing more pages than Google. They found the claim to be false. The quality of the searches weren't the subject of Yahoo statements or NCSA's testing of those statements.

That's not to say it isn't an interesting question, but that it really wasn't relevant to the article.

Re:That wasn't the point of the study by ngunton · 2005-08-15 07:08 · Score: 1

Actually, the number of results returned has nothing to do (necessarily) with the size of the index. That was my point, which is relevant to the article. Besides, this is a discussion, and discussions have a way of, well, talking about things related to the article. Maybe Yahoo! just gives more relevant results, who knows. But just going by the number of results returned isn't a useful metric, in my book.

Quality vs. Quantity by Sigh+Phi · 2005-08-15 06:57 · Score: 1

The study only addresses the issue of size of the indices and returned results. Understandable, and it certainly debunks Yahoo's claims, or at least, makes them irrelevant -- what good is a 19 billion-page index if you don't actually get any more search results?

But the real utility of a search engine is the relevance of those search results. Google has been successful because its search results are relevant to a large portion of its users. The real question when comparing search engines is, can one help you find what you're looking for faster than another?

Yahoo may have a huge index, Google may return more results, but neither metric alone will tell you which one you actually want to use for general internet searching.

Re:Quality vs. Quantity by furry_marmot · 2005-08-15 07:36 · Score: 1

Agreed. I've seen plenty of instances on Google where it will return several hundred results on a very specific search and if you don't quote the phrase ("red bull" vs. "red" and "bull"), you'll start getting results that only have one of the words after the first page or two. Without some qualification of the relevancy of the results, the study is meaningless.
Come to think of it, about 10 years ago I was asked to evaluate a search engine for an intranet. The engine was hyped as being able to distinguish relationships between data (the example they gave us was an indexing of Civil War data involving family members, from which you could ask it "Who was X's father?" and it could tell you). It would have cost $10,000, plus its own server, since it brought the Sun Sparc Server the intranet sat on to its knees with each query.
Up to that point, we had been using a local search we got from some folks we knew at Excite.com (remember them?) that consisted of a binary and a Perl script. When we compared results, the fancy engine often gave us only 2 or 3 results, finding the exact page we were looking for. In this case, fewer results were considered better. The little Excite local engine would give us a couple of dozen results, but the correct page was always in the first 5 links.
For the cost and hassle, the "better" engine wasn't worth it and we passed. But I really think you have to define what "better" means before doing a study like this. Google's "more results" could just be more noise, not more information.

Re:Accurate results? Bad example by SirSlud · 2005-08-15 06:58 · Score: 1

Terrible example. Search for "http" ... MUCH more interesting. They don't even strip "http://" off the URLs when they do their scoring!

--
"Old man yells at systemd"

Re:Queries with 1,000 results by barawn · 2005-08-15 07:04 · Score: 2, Insightful

That makes sense, but it does stand to reason (or, at least, to my reason) that these queries that garner large numbers of results could have had a significant impact on the bottom line of the survey.

Well, there's a worse bias. They're grabbing words from an Ispell word list.

There are websites which contain the Ispell word list. There appear to be more of those returned in Google as results than in Yahoo. (here is one returned in Google for "apprizers expense", but which is not returned in Yahoo.)

This basically contributes a pedestal to their result - they'll never get zero results, because they'll always get the Ispell lists back, and because those results always return the same number (about 8 Google to 1 or 2 Yahoo), you'll bias the results of the entire set to that result.

They needed to remove results which are returned in common to multiple searches, as that's essentially double counting.

Or the other way around by jetkust · 2005-08-15 07:06 · Score: 1

Maybe Yahoo indexes more useless pages than google does.

Re:Or the other way around by emcmanus · 2005-08-15 07:16 · Score: 1

There's no use in trying to discern a useful page from one which isn't, when indexing. It's a separate matter when it comes to returning results, but honestly this all goes to show the futility of lending a sense of importance to index size above a certain threshold in the first place.

you are WRONG by alarch · 2005-08-15 07:08 · Score: 2, Insightful

try it. for example search for "swans" : you got 1 510 000 results, the first one is the SWANS rock band site. search for "swan" then - 8 550 000 results, the first is some SWAN social network - the rockers are not on the first page at all

--
Deliriant isti Americani.

Re:you are WRONG by Anonymous Coward · 2005-08-15 07:12 · Score: 0

Even the example he supplied ("inkjet printer" vs. "inkjet printers") returns a different top result.

People, please mod him down. This might be +5 Interesting if it were actually correct!
Re:you are WRONG by WoTG · 2005-08-15 07:50 · Score: 1

OK. I admit it. I was wrong. And, no I didn't test the inkjet printer/s example either. Whoops.

But, you'll have to belive me, there was a time when Google was doing the plural thing -- I distincly remember trying it several months ago when it raised a bit of interest in the forums at sitepointforums.com. If you google for "google stemming" you can find a few similar threads on various forums.

And, according to this page: http://www.google.com/help/basics.html (search for stemming), stemming is used to return results. So, the # of results returned for obscure searches will still be higher on Google than on a search engine that does not use stemming.
Re:you are WRONG by adpowers · 2005-08-15 08:28 · Score: 2, Interesting

Google does use stemming, I see it all the time. The results are still different, though, because I'm sure they weight the main query higher than the stems.

Also, you can see something to similar to stemming when you search for certain acronyms. Try searching for [lotr] or [ada]. It also performs searches for the full version of the acronym, as you can see by the bold query in the snippets and title.

Where's the Correlation? by emcmanus · 2005-08-15 07:10 · Score: 1

I'm not exactly sure why Slashdot would choose to publish such a poorly conducted study as this.

The entire experiment is founded on the idea that there is a strong, if not direct, correllation between returned results and index size, which is absolutely rediculous. Given that each engine's search algoritms are so closely guarded, there is no way to tell what sort of correllation there is between the number of results for random queries and the searchable index size. Without addressing this issue, this article looks to be nothing more than part of the typical google fanboy fare posted here, and it's frustrating to say the least.

The Ladies Always Tell Me .... by Anonymous Coward · 2005-08-15 07:11 · Score: 0

Its not the size of the index .. its how well you use it! ... course a large well used index is even better. har har.

Try viewing all Google's results by Anonymous Coward · 2005-08-15 07:13 · Score: 0

Have you ever tried to view the 10,000 pages that google returns? It's impossible. What's the point of saying that thousands if not millions of pages are found if only the first 400 can be viewed?

Do the simple test of searching for failure in google. Next, go to the very last page. Google claims there are 80,100,000 pages with failure but I could only view 899 pages while showing the omitted results. What and where are the other 80+ million?

Are more search results "better"? by Vellmont · 2005-08-15 07:18 · Score: 2, Interesting

There's an inherent assumption in the Yahoo claim that more==better. Do I really care if a search returns 1 million results vs 6 million results?

What I care about is actually getting the information I went out to find. There's only a certain amount of hits I'm willing to explore. That's probbably on the order of 100-200 or so if I _really_ need the information. The implication by Yahoo is that more hits == better top ranked hits. Is that true? Really what should be done is just compare the top few hundred hits between the two search engines and see how they differ. Those are the only ones that matter anyway.

Where more results might prove usefull is obscure searches with less than 100-200 hits. But if this study is true, Yahoo does a worse job on obscure searches that google.

The problem of course is the type of obscure searches that this study performed. Two random words out of a dictionary just isn't what your typical person conducting a search engine query is looking for.

--
AccountKiller

Re:Are more search results "better"? by Supergibbs · 2005-08-15 08:13 · Score: 1

Well sure, I agree that I rarely look past the first page or two of search results. But it does mean something when one produces more, more data, which is valuable. Say you tweak your query to limit it more, then you are searching within the subset of the original results. The real test is, are all of the original results relevant? I could write an engine that very loosely returned results but they wouldn't be helpful.

--
First post! (just in case I am...)

Grammar by ryanov · 2005-08-15 07:24 · Score: 1

Should be "fewer" results, methinks. Doesn't NCSA have editors? Thought they were kinda up there as professionalism goes.

Re:Grammar by Anonymous Coward · 2005-08-15 07:45 · Score: 0

I second that. I found it very distracting.
Re:Grammar by deanj · 2005-08-15 10:32 · Score: 1

Does NCSA have editors? No.

Frankly, I'd be shocked if the higher ups there even knew this "study" existed.

I don't know about your results by jerryodom · 2005-08-15 07:26 · Score: 1

I don't know about 12 times. You've got to be realistic and do something thats up to date as Britney is soooo last season. Jessica Simpson on the other hand occurs 3.78 million times in Google as opposed to 21 million in Yahoo. Google is gaining ground as Yahoo is now only about six times better than Google.

--
For some reason I refuse to use either spell check or the spacebar properly.

Flaws in methodology by brokeninside · 2005-08-15 07:28 · Score: 2, Interesting

1. Assumes that Yahoo's expansion is random. If the increase in Yahoo's pages are not random, then the results may be skewed. For example, Yahoo's expansion may have been mostly, or even entirely, in pages built of common words that all receive more than 1000 hits upon searching.

2. Assumes, as many people have stated, that by using an English dictionary for its seeds, the study assumes that Yahoo's expansion has been in English. If Yahoo has expanded it's database in non-English pages with few words that overlap into English, those pages will not show up in the study.

This study essentially determines that Google has a larger database of random, obscure English language words. Consequently, they demonstrate that Google is the superior search engine for finding obscure, random English words.

One additional check that they could have thrown in would be how many of the pages in the links presently deliver 404 errors. That would have been far more interesting to me than how well the search engines do at finding obscure and random English words.

Of course they should be included. by douglips · 2005-08-15 07:28 · Score: 1

If you search for "hnc software", the first hit is Fair Isaac. The Fair Isaac web page has no mention of HNC. And yet, this is appropriate because HNC Software no longer exists since Fair Isaac bought them.

Google does the right thing, it's googlebombers who are messing with your head.

--
My amazing wife - Artist, Author, Philosopher - Laurie M

Conspiracy to evade the real comparison! by convex_mirror · 2005-08-15 07:33 · Score: 1

Why is the NCSA cowering from the Google vs. Infoseek comparison?

(yes, yes, I know it uses the Inktomi engine too - would you have preferred a prodigy reference?)

Those are estimates by mcc · 2005-08-15 07:35 · Score: 4, Insightful

Of course the study also demonstrates that on the searched terms, Yahoo's estimate numbers vastly overestimated the number of available results they actually found. So if the pages from the study are even close to representative in that regard then this would make the numbers you quote utterly meaningless.

Which is the entire reason, of course, why they kept the limits under 1,000 in the first place-- that for any number over 1,000, if the search engine says, say, "I found "2.5 million results for 'Valerie Plame'", you have no way to tell whether it's telling the truth or not.

--
Irritable, left-wing and possibly humorous bumper stickers and t-shirts

Re:Those are estimates by Krach42 · 2005-08-15 07:48 · Score: 1

It's not overestimating... it's rounding graciously...

--

I am unamerican, and proud of it!
Re:Those are estimates by Anonymous Coward · 2005-08-15 08:53 · Score: 0

you have no way to tell whether it's telling the truth or not.

Sure you do: use something like clustering to compare the results amongst themselves and then you'll know if you got a whole lot of duplicates or not. You could even double-check that every returned document indeed contained those words and, with a bit more analysis, determine if these pages were spam or actually talked about what the search engine said they did.

Just because this is computationally- and bandwidth-intensive, doesn't mean it is impossible!

Figures never lie... by Anonymous Coward · 2005-08-15 07:36 · Score: 0

Figures never lie, but liars often figure.... Now where's the (*&% calculator?

Confused... by i.of.the.storm · 2005-08-15 07:39 · Score: 1

well anyway here's a site that does a comparison by putting google and yahoo in frames and doing identical queries. http://www.googleguy.de/google-yahoo/ But I thought yahoo also uses Google's results in their thing. Althought that might have changed. I remember a couple of years back Yahoo had a little thing on their search results that said powered by google. So of course yahoo will have more than google since they are using Google's results along with their own. http://www.langreiter.com/exec/yahoo-vs-google.htm this site also does comparisons, but it shows a nice little graphical thing. I think it shows how google and yahoo results overlap. Oh and on a side note, I'm new here and I was wondering whether the term "slashdotted" means that a site was overwhelmed with traffice from slashdot.

--
All your base are belong to Wii.

Size doesn't matter by gillbates · 2005-08-15 07:46 · Score: 1

Specificity does. A good search engine will find all relevant pages. A great search engine will list the page you're looking for in the first ten results.

Often times, the only thing a bigger index does is make the user scan more results before finding the page they want.

--
The society for a thought-free internet welcomes you.

Re:Queries with 1,000 results by Anonymous Coward · 2005-08-15 07:48 · Score: 0

It is about verifiability: they want to be sure they can compare the result sets independent of how each engine ranks them. This is an important point.

Personally I'd design an experiment where I also look at queries that return more results that can be checked, check the first 1000 and see if the ratios you end up with still seem to hold true. If they do, then the 1000 hit limit is an unnecessary constraint.

Your argument that the result is skewed by differences in "depth" of crawling etc. probably has little merit. For it to have any merit this would imply that global term frequencies in the search engine dictionaries would be skewed by "going deeper". There is no indication that this happens in any recent research I've seen.

Chris Dibonia by ptarjan · 2005-08-15 07:50 · Score: 1

Just to let everyone know, Chris Dibonia (the poster), is in charge of the Open Source arm of Google.

This means he is the one that pays me for the Summer of Code, so be nice!

Another Search Engine by Anonymous Coward · 2005-08-15 07:55 · Score: 0

Just an FYI for those who want to try some-
thing other than Yahoo or Google. Look at
Teoma (teoma.com). I've been using it for a
while. Seems to work pretty well though I'm
not sure if it's quite as good as Google.
Here's a little blurb from their website:

Teoma's History

Teoma was founded in 2000 in Piscataway, New
Jersey by a team of scientists from Rutgers
University. Teoma means "expert" in Gaelic. Ask
Jeeves, Inc. acquired Teoma in September 2001.

And, no, I don't work for Teoma.

Clear evidence of this by Anonymous Coward · 2005-08-15 07:55 · Score: 0

Search for:
+the * *

Yahoo returns more results.

clusty rocks by Anonymous Coward · 2005-08-15 07:58 · Score: 0

I like http://www.clusty.com/. This meta search engine clusters results according to relevancy.

This test proves nothingand here's why by Anonymous Coward · 2005-08-15 07:59 · Score: 0

This research does not even hold for what its worth. First of all, Yahoo indexes X amount of pages but does not mean it would display X amount of results based on X amount of pages indexed.

Using the google whack search for "passalong louse",
Google returns 647 results for passalong + louse but look at the results it returned. It has "passa over", "passa-long","passalong", etc..
while Yahoo only returns 3 results all containing only "passalong"

So accuracy wise, yahoo wins on this.

the meaning is in the words by SecularG · 2005-08-15 08:00 · Score: 1

yahoo has some 20 billion items and google has some 8 billion pages pages != items I say that we query Yahoo! for how many pages they really have. Not items.

Re:the meaning is in the words by tommers · 2005-08-15 08:32 · Score: 1

I think the term items was used because Yahoo's figure included pages, images, and multimedia. They made a claim about pages as well as items. The 20.8 billion figure was a combination of pages (19.2b), images (1.7b) and audio/video (50m) http://www.detnews.com/2005/technology/0508/12/0te ch-274198.htm

What this study fails to take into account ... by mshmgi · 2005-08-15 08:01 · Score: 1

This study assumes that both Yahoo! & Google rank pages the same way. WRONG!

Google's methods for determining a page's relevancy to a search term varies widely from Yahoo's methods.

In order for this type of study to have any validity, identical ranking methods would need to be employed over both indexes.

With Google pages do not have to have all words by trelony · 2005-08-15 08:13 · Score: 2, Insightful

With Google for a page to be found, other pages that reference the page may contain the requested words, but not the returned page itself.

Re:Queries with 1,000 results by TheRaven64 · 2005-08-15 08:13 · Score: 1

One of the lecturers in my department used to include a copy of the ispell (or maybe aspell) word list on his site, in a random order for the data structures module. The coursework in this module consisted of putting the word list into various data structures and searching / sorting it. One year, he got a visit from Interpol. Apparently they found a particular sentence in the middle (by using a search engine) which appeared to be related to some form of organised crime. Now he zips the wordlist...

--
I am TheRaven on Soylent News

Accurate Numbers by Anonymous Coward · 2005-08-15 08:20 · Score: 0

Many of the points made in the comments so far and what it would take to get accurate numbers for comparative purposes are mentioned in the Search Engine Watch Blog post from last Thursday.

http://blog.searchenginewatch.com/blog/050811-2314 48

What I want to know... by Anonymous Coward · 2005-08-15 08:22 · Score: 0

When I google for "jesus site:holy-bible.us", Google returns no results.

I could have sworn that Jesus is mentioned in the bible somewhere...

Much of the contents are not even real by Anonymous Coward · 2005-08-15 08:31 · Score: 0

My guess is that yahoo just indexed a lot of data from searching other search engines with bot spiders.
After searching for myself in Yahoo I found a site I had about 8 years ago, and that is already dead (was in geocities) for more than 4 years. Even more interesting... the contents of the site were the first version, not even the one in the site when it was closed.

It's not the size that counts.... by DaSpudMan · 2005-08-15 08:33 · Score: 1

Size doesn't count it's how you use it. Or so I've told. Er, um, .... of course *I've* never been told that, I'm just repeating what others say. Move along now.

--
> > >We don't need no steeekin'.....oh wait, my wife says we do.

Included but not treated the same by snowwrestler · 2005-08-15 08:34 · Score: 1

Google automatically includes stemming in searches, but not necessarily at the same ranking of the original search term. So while searches for "inkjet printer" and "inkjet printers" will not return the same results list, many of the results from each will be included somewhere in the results list of the other.

--
Build a man a fire, he's warm for one night. Set him on fire, and he's warm for the rest of his life.

Dibona = kiss of death for company by Anonymous Coward · 2005-08-15 08:37 · Score: 0

Just don't take that payment in Google stock. Chris DiBona is the kiss of death for a corporation: he has NEVER worked for a profitable company and several have tanked while was there. He is the kiss of death for a company.

Holy lack of IR stastics understanding, Batman! by freality · 2005-08-15 08:41 · Score: 4, Interesting

The most basic measure of performance in Information Retrieval is precision vs. recall.

Precision is how many of the results that you return are correct. e.g. If Google returns 100 results and 10 of them are correct, then the precision on that query is 10%.

Recall is how many of the correct results you return. e.g. If Yahoo returns 100 results out of a total 1000 correct matches, then the recall on that query is 10%.

Information retrieval systems such as search engines balance these two metrics -- which are fundamentally at odds with each other -- to give the "best balance" in the eyes of the system's designers.

The NCSA study basically misses the effect this decision would have on perceived size of index.

A simple demonstration shows how it works.

First let's say both search engines have the same index size: 10B pages. Second, let's say both search engines have exactly the same apriori capability for precision and recall, but can tune for a preferred performance. Yahoo decides it wants to favor more precise results over more results recalled, at a 2:1 relative ratio compared to Google.

In that case, any given query will show half the hits from Yahoo as compared to Google. Concluding Yahoo's index to be half the size of Google's, given this result, would be incorrect.

Furthermore, without knowing the precision/recall performance of either system, they can only demonstrate a lower-bound on index size, and that certainly doesn't predict average or max index size.

Scientific meathod anyone? by cenobyte40k · 2005-08-15 08:52 · Score: 1

This really does not work out well. The idea behind doing any kind of research is to eliminate all the variables except the one that you are testing. That way you are not trying to compare apples to Oranges. Unfortunately these 'study' failed to take into account the system for returning results, the system for indexing the pages, and the system for applying weight to something. Google (My favorite search page) tends to return results because something that links to that page or is linked from that page matches your search. Yahoo however does not seem to do that even 1/10th as often. This could account for lots and lots more results. In fact I am sure that I could build a seach engine that could index less than a million pages and turn out more results than Google every time if I make my search engine open enough in the way it returns results. Personally I am more upset about this than the stuff that is obviously opinion. When it's opinion only fools (And there are plenty of them on this site) mistake it for science and fact, but this is like watching a Michael Moore movie.

No difference by Incontinent · 2005-08-15 08:56 · Score: 1

Whenever I am searching I rarely notice a difference between Google and Yahoo's results, at least on the first couple of pages. Yahoo may claim to have twice as many indexed pages however I have yet to see any results in my queries.

Results of this study are not accurate by Anonymous Coward · 2005-08-15 08:57 · Score: 1, Insightful

I did a few spot searches myself and one thing that makes a huge difference is google does "smart" searching.. if you type in a phrase and google suggests that you meant something else, it will search for that as well and combine the results. This would give google a larger result set. Therefore it is impossible to determine whose indexe is bigger because the way they build their search results is inherently different.

quality not quantity by azbot · 2005-08-15 08:59 · Score: 1

If I'm searching for something, I want to find it. I don't want to have to search through extra data. This article seems to point out that it is "estimating" the amount of results returned. Which I think is un-important. What I think is important is the validity of the results to the query I type. I don't see how these figures show "Quality" any more than "Quantity".

Why not try this... by HacTar · 2005-08-15 09:00 · Score: 1

I agree that searching random words cant be considered a real test. For a real world test (with not too many results) I searched my name and surname both Yahoo and Google and found that the number of result was quite similar, but curiously some websites was listed only on yahoo and some only on google.

Could be interesting do a statistic about that for a large amount of people. Anyway better two search engines than one!

Their finding is inaccurate by fani · 2005-08-15 09:03 · Score: 0

.... or Google stopped publishing certain results which are unlikely

Consider this search --
Terms: centerable's heterolecithal
Google totals:
Duplicates Omitted Estimate: 3
Duplicates Omitted Total: 3
Duplicates Included Estimate: 3
Duplicates Included Total: 3

Yahoo totals:
Duplicates Omitted Estimate: 0
Duplicates Omitted Total: 0
Duplicates Included Estimate: 0
Duplicates Included Total: 0

I typed the term "centerable's heterolecithal" into both Google and Yahoo and they both return 0 results. The author claims 3 from Google ????

Hope he hasn't doctored any results to show in favor of Google.

Anyways this whole survey was a pointless waste of time and to add to it my time too.
Please mail me $50 for my time wasted.

Re:Their finding is inaccurate by MushMouth · 2005-08-15 10:30 · Score: 1

I actually get three pages of straight up crap from Google. Maybe the truth is that Yahoo! is better at filtering SEO pages.

I am not interested in numbers by houghi · 2005-08-15 09:08 · Score: 1

I am interested in results. I understand that more pages indexed means normaly a better result. However I just want to get to the information I can se for whatever it is.

http://vivisimo.com/ is an engine I like using, becaue there you can get to things that you want rather quick without the need of looking though pages and pages of non- relevant pages.

--
Don't fight for your country, if your country does not fight for you.

Results for search by Anonymous Coward · 2005-08-15 09:10 · Score: 0

The point (IMO) is how many of the results are useful for my search.
How many of the results (the comparison by NCSA says A gives xx.x% more than B and such bs) are *really* what i was searching for?
Rethoric answer: very a few.
Even worst, many of them today are search engines who point to search engines who point to search engines and so on, in a meaningless loop.
The main richness of a (let's say) Yahoo or Google or such, is no more in the number of indexed objects (hey, we're still talking about billions of items; one more or one less doesn't change the matter), but in correctness of the answers.

In a related study at a local sports bar... by snowwrestler · 2005-08-15 09:14 · Score: 1

Queries for all-night love machines with dicks over 12" in length returned several hundred results!!

Actual mileage may vary!

--
Build a man a fire, he's warm for one night. Set him on fire, and he's warm for the rest of his life.

Let's not forget... by sootman · 2005-08-15 09:27 · Score: 1

...if you do a search for something on Google and it comes back with a small number of results, and you get to the last page, it often says "In order to show you the most relevant results, we have omitted some entries very similar to the 8 already displayed. If you like, you can _repeat the search with the omitted results included_." So the dupes are there to be had if you want'em.

--
Dear Slashdot: next time you want to mess with the site, add a rich-text editor for comments.

Googlebomb by medgooroo · 2005-08-15 09:34 · Score: 0

Do people even yahoo bomb? The price of being the biggest.

--
Brain(s): 0.0% user, 1.3% system, 0.1% nice, 98.6% idle

Yahoo's estimates are bogus. by kavau · 2005-08-15 09:46 · Score: 1

Yeah, except that Yahoo's estimates are bogus. See for yourself: search for 'arabesque hard disk screws' or some other rather obscure term. Yahoo reports about 602 results for this example. Now click through the result pages, clicks "repeat with the omitted results included", again click through the results pages. Where do we end up? A lousy 120 results! 602 results my @$$! And this is just one random example, I tried many!

Bogus "estimated" values. by douglips · 2005-08-15 10:29 · Score: 1

Yahoo says it will give you 418 "Ronald Hendrickson" pages, but only gives you 110.

Yahoo gives you the full 175.

--
My amazing wife - Artist, Author, Philosopher - Laurie M

Re:Bogus "estimated" values. by Murasaki+Skies · 2005-08-15 21:00 · Score: 1

Yahoo has a split personality?

--
Waiiii!!!!!! I have bad karma!
Re:Bogus "estimated" values. by douglips · 2005-08-16 16:21 · Score: 1

Yeah, "Google gives you the full 175". Bastard slashdot - I kept trying to follow up, but it was "Slow down cowboy!" for a good long time, so I said "Screw it, nobody reads at +2 anyway."

--
My amazing wife - Artist, Author, Philosopher - Laurie M
Re:Bogus "estimated" values. by Murasaki+Skies · 2005-08-16 20:35 · Score: 1

Slow down cowboy! I read at +2 sometimes (for unpopular articles).

--
Waiiii!!!!!! I have bad karma!

conspiracy theory by kaoshin · 2005-08-15 10:31 · Score: 1

Perhaps this is biased? An American Civil Liberties Union supporter with a personal interest in "race relations" must certainly have a nitpick with Yahoo after the law suits of the organization against Yahoo, and what with nazi memorabelia being posted and so forth. Perhaps that was the motive behind the slant in this research?

Let's settle this by Criffer · 2005-08-15 10:52 · Score: 1

Let's settle this once and for all - which is the better search engine - in the only way possible:

GoogleFight!

New study... by vwjeff · 2005-08-15 10:57 · Score: 1

Clearly the study is Google biased. Why look at these results from Google. Then, look at these results from Yahoo. Clearly, Yahoo is the winner with more results.

Oh wait. Um, I guess that's not a good example. In this case I would take Google due to fewer results. No, no, I would take neither.

If you're wondering, I'm watching Family Guy right now. Yeah, that's my excuse. What's yours you pervert?

There is a difference between... by Anonymous Coward · 2005-08-15 11:31 · Score: 0

... the number of pages that are "indexed" and those that are actually IN the "index"... For example, using a weblog processing tool I wrote I discovered that search engines will frequently access the same page over and over again... So Yahoo may be calling their "index" size the number of items that have been indexed... rather than the number of items actually in the index... For some this is a semantic difference, for others "truth economics"... Also, when you check the estimated pages for any particular site for example: http://www.google.com.au/search?hl=en&safe=off&q=s ite%3Aslashdot.org&btnG=Search&meta= and: http://search.yahoo.com/search?p=site%3Aslashdot.o rg&prssweb=Search&ei=UTF-8&fr=FP-tab-web-t&fl=0&x= wrt you can see that they tend to vastly overestimate the number of pages that a site has... A more useful way of estimating the index sizes would be to use the "site:xyz.com" searches for both... Of course the robots.txt file for each site would need to be considered however in case the webmaster(s) have decided to lock out a particular engine...

Google Dictionary results?! by fprog · 2005-08-15 11:46 · Score: 1

While the Perl script is both nice and readable: http://vburton.ncsa.uiuc.edu/compare.txt

The log results are shown here: http://vburton.ncsa.uiuc.edu/searchresultlog.txt

For instance, the following queries were supposed to give 5+ results on Google and no results on Yahoo, so let see if that actually works...

Sometimes you get no results on Google "on the first tries" go figure... "that server is down/busy?!"

If you get any results they are the same repeating over and over ispell dictionary word list!

I don't know about you but that's pretty useless...

Also, the fact that both search engines limits to the first 1000 results, that's pretty useless, how can we know for sure there is 100000+ results for apple if after page XYZ, results are truncated?

Here's some queries:

http://www.google.com/search?hl=en&q=carbolization +clambers
http://www.google.com/search?hl=en&q=anecdote%27s+ displosion
http://www.google.com/search?hl=en&q=centerable%27 s+heterolecithal
http://www.google.com/search?hl=en&q=unobservable+ Oistrak
http://www.google.com/search?hl=en&q=misanthropize s+multiplications
http://www.google.com/search?hl=en&q=buttonmould+g radated
http://www.google.com/search?hl=en&q=myocardiograp h+overheard
http://www.google.com/search?hl=en&q=pinions+plati tudinize
http://www.google.com/search?hl=en&q=sloppiness+co educationalizes

I dont know what this article is talking about ... by ASUSanator · 2005-08-15 11:56 · Score: 0

I don't know what this article is crapping on about but in my (admittedly limited) test i got far more results using yahoo than google.

Study is flawed and a real world analogy by GatorBait · 2005-08-15 12:18 · Score: 1

This is not a valid study. The biggest issue is it selectively choosing search terms which yield less than 1000 results. For instance the query ipod would not count. But perhaps "my monkey broke my ipod in my camry" would count. Point is the study proves that Google has more results for very obscure queries that yield very little results (1000). Give it to the NCSA they did they best they could to do a study with whatever information was available. It got people to at least think about it.

WTFlaw? by millennial · 2005-08-15 12:26 · Score: 1

Unfortunately, both the Yahoo! and Google search engines truncate results returned to the user after 1,000 results. Thus, for the purposes of this study, we were forced to restrict our searches to those queries that returned less than 1,000 results on both Yahoo! and Google. Any search result found to have more than 1,000 returned results on either search engine was disregarded from our sample. [3]

So. Let's say that Yahoo! and Google didn't restrict the results to 1000. Let's say that some search returns 1,000,000 results MORE on Yahoo! than on Google. Let's say this happens many, many times.
This entire study would be invalidated.
Since you can't know how many results there actually were for each over-1000-result search, how can you tell which engine has more pages indexed?
If Yahoo! had 20,000,000,000 pages indexed and Google only 9,000,000,000, and one term appeared in every single page, you would get a result of "1000" for each engine, even though the real difference is 11 billion!
Flawed? I think so.

--
I am scientifically inaccurate.

Yahoo is CHEATING! by simos · 2005-08-15 13:02 · Score: 1

Yahoo appears to report more hits that it has.
For example, try out "dorani". I do not know what it means, but it's a good choice as it shows ~20000 hits. Clicking on the next result page, we can see the full results.

Yahoo(dorani), 1st page: 17,000
Yahoo(dorani), 6th page: 16,800
Yahoo(dorani), 90th page: 4,220

Google(dorani), 4,330

I have seen this pattern consistently for terms bringing between 15,000 to 100,000 results in Yahoo.
When Yahoo is asked to show the results, they diminish.

Google's search results may be inherently bigger by Anonymous Coward · 2005-08-15 13:12 · Score: 0

As mentioned in other posts here, Google's search results include pages linked from pages containing the search term[s]. That is, the documents in the search results may not themselves contain the search terms. If Yahoo's search results does not include such pages, then we can expect a systematic bias in favor of Google when counting the number of pages indexed based on the number of search results.

The point is, the study does not check whether the pages referenced in the search results do indeed contain the search terms. This extra check should be fairly easy to do with a small random sample.
The results of the sample can then be used to calculate a bias coefficient (scaling factor) for each of the search engines.

Why is NCSA getting involved??? by Anonymous Coward · 2005-08-15 13:27 · Score: 0

I checked out the NCSA home page and I didn't find anything about advertising assistance for google vs. yahoo. This is a waste of money. Who pays for these people.

What's next? Is NIH going to do a taste test between coke and pepsi???

Re:And when I search for "Linux" via Google... by cduffy · 2005-08-15 13:32 · Score: 1

So what was your point again?

Obviously, that Yahoo! is sufficiently interested in advertising income as to compromise the quality of their results by returning sponsored links which aren't clearly and obviously marked as such whereas Google (while still seeking such income) has the decency to avoid permitting their ads to disrupt the user's attempt to read for non-sponsored results.

Yahoo! puts the interest of their advertisers above the interest of their users. Google serves their users first -- and, by doing so, attracts the eyeballs with which to gather advertisors even without using dirty tricks to get their ads viewed.

Re:Quality Quantity by 51mon · 2005-08-15 13:46 · Score: 1

Okay try some large results searches.

"http" 2.1 billion Yahoo claims, 2.36 billion Google claims, and Google was back in half the time.

But Yahoo gives you a page titled "Hypertext Tranfer Protocol Overview www.w3.org/Protocols", Google gives you the Microsoft website (huh?) in first place.

Hmm, after trying a few others I've decided I need a simpler ranking system. As the results are far too evenly balanced to call.

Yahoo gives me higher ranking for my own name than Google, so clearly it has a better algorithmn, anyone who disagrees will have to fight my ego.

Mutual search results by jawahar · 2005-08-15 15:30 · Score: 1

We see interesting results by searching for "yahoo search" in google and "google search" in yahoo.

By the way this is only the first step for building great search engines as outlined in http://slashdot.org/comments.pl?sid=154275&cid=129 48223

--
Slashdot = Sarcasm

SEO Spammers by crucini · 2005-08-15 16:13 · Score: 1

Simple exercise: search "aleut handshaking" on both engines - no quotes. Google gets 114 hits, Yahoo gets 32. Yay Google. Now take a closer look at those hits.

Yahoo is better than Google at blocking out these "Search Engine Optimizers" aka spammers.

Re:SEO Spammers by dtietze · 2005-08-15 20:29 · Score: 1

That's almost exactly what I was thinking, as well. They use the ispell dictionary for randomly generated searches. OK. Search-Engine Optimization SPAM and link farms use publicly available word lists (such as the ispell dictionary) to generate bogus pages containing combinations these words.
So - this study is absolutely NOT scientific. All they've proven is that spammers manage to pollute the Google search index quite effectively. More effectively than with the Yahoo! index.
Dan.

NEVER ASSUME, TRY BOTH -- Re:Accurate results? by dysonlu · 2005-08-15 17:15 · Score: 1

Never assume Google always returns the best (most accurate) set of results. Example from real life: Just 10 minutes ago, I wanted to know the weight of the Head Ti Radical tennis racquet that Andre Agassi formerly used. My query was "agassi head ti radical weight". After going through four pages of results from Google, you couldn't locate the info. Tried Yahoo! and its very first result was SPOT ON!

Re:And when I search for "Linux" via Google... by Anonymous Coward · 2005-08-15 17:17 · Score: 0

You're stupid. The links are clearly marked as sponsored links.

Relevance by billybob · 2005-08-15 18:33 · Score: 1

I'm sure that the relevance math involves some complicated formulas, but simply determining a "match" is simple. Does the document contain the search terms? Yes or no. Google finds more results. That leads me to conclude they have more pages indexed.

--
Joseph?

Re:Relevance by Anonymous Coward · 2005-08-16 06:44 · Score: 0

Quoting from bigwavejas in one of the first posts:
Try searching for the word, "failure" in Google and check the results.
Now Ctrl+f to find "failure" on the first result.
Surprise!!

The obvious solution... by Anonymous Coward · 2005-08-15 20:53 · Score: 0

...to this question is simply for Yahoo and Google to print a copy of their respective caches. Then, assuming that the same font/size is used, it should be easy to identify which is larger.

Honestly, sometimes it's real easy to over complicate matters.

Re:And when I search for "Linux" via Google... by cduffy · 2005-08-15 21:18 · Score: 1

Marked, yes. Clearly, no. I'm simply a casual reader in this context -- I don't notice things which aren't in the way of my eyeball, and the way Yahoo! formats their notice of those results as sponsored, said notice isn't.

Error in reasoning: Assumes same algorithm! by Theovon · 2005-08-16 04:58 · Score: 1

The one mistake these researchers are implicitly making is that they assume Yahoo and Google are both using the same search algorithm.

Perhaps Google is just better at matching query strings to results, because it finds more relevant results. Or perhaps Yahoo is better because it excludes more irrelevant results.

Either way, this says nothing about the size of the database. In information theory, there are these terms, "precision" and "recall". I forget exactly what they mean, but they have something to do with how many results you get that are correct compared to how many results you get and something to do with how many correct results you get compared to how many correct results exist in the whole index. Something like that. Anyhow, surely, Yahoo and Google will differ, and THAT is what we're measuring here.

Google Fights by paulsnx2 · 2005-08-16 06:05 · Score: 1

www.googlefight.com

Yahoo wins, 295,000,000 to 270,000,000

Oddly, a manual Yahoo search yeilds:

Yahoo wins, 866,000,000 to 473,000,000

One wonders if the methods of this paper err in assumptions about the types of content being indexed. If the increase in pages indexed by Yahoo is due to formal, published content, or non-English content, or (pick an option), then it might not translate into more hits given obscure word combinations. That is because the additional content isn't a random selection of possible web pages.

Just a thought.

Slashdot Mirror

NCSA Compares Google and Yahoo Index Numbers

395 comments