NCSA Compares Google and Yahoo Index Numbers
chrisd (former Slashdot editor and now Google employee) writes "Recently, Yahoo claimed an increase of index size to "over 20 billion items", compared to Google's 8.16 billion pages. Now, researchers at NCSA have done their own, independent, comparison of the two engines. "
Honestly, when I first heard the news over the weekend I thought "rubbish, they must be ignoring requests for spiders to go no further or something." I guess NCSA can either 1) Expect no gifts from Yahoo OR 2) Report significantly different results after a sizable gift to NCSA.
75% less truth than other leading brand
A feeling of having made the same mistake before: Deja Foobar
Try searching for the word, "failure" in Google and check the results.
This brings into question *accurate* results. In this case it appears that's left to interpretation.
"Simplify, simplify, simplify!" Thoreau
It will take a while.
"Based on the data created from our sample searches, this study concludes that a user can expect, on average, to receive 166.9% more results using the Google search engine than the Yahoo! search engine. In fact, in the 10,012 test cases we ran, only in 3% of the cases (307) did Yahoo! return more results. In 96.6% of the cases (9,676) Google returned more results. In less than 1% of the cases (29) both search engines returned the same number of results. It is the opinion of this study that Yahoo!'s claim to have a web index of over twice as many documents as Googles index is suspicious. Unless a large number of the documents Yahoo! has indexed are not yet available to its search engine, we find it puzzling that Yahoo!'s search engine consistently returned less results than Google. "
I was wondering how accurate were the results that the companies themselves reported. Or are they accurate, but they just spidered sites that don't matter to anyone?
Send email from the afterlife! Write your e-will at Dead Man's Switch.
but they can't sift through it nearly as well as Google, so what does it matter? Even if you have a bigger dictionnary, if you can't speak English at all it won't do you much good.
I never spellcheck and I freely admit it. Save your karma for more worthwhile "lol erorrs" replies
Just use both...Then you'll be certain to have a nice unbiased search result. ;)
I'm working on a good joke about your mom being
All Google does is index the web. In this case, it seems like there are more web pages/more highly linked pages about GW being a failure than anyone else.
Is this that hard to beleive? What would you rather it return for such a query? A dictionary definition? If you want a dictionary definition, use the define: oerator.
Trust me - GW will not be on the top of the failure list forever. In another few years we will have a new most-hated person. This is the nature of a real web index, because it is the nature of the web, and of society itself - it is fickle.
Sorry, but if Google consistently returns more results, it could just as easily mean that the filtering isn't as good.
I still prefer Google though.
in short: truth is that size does not matter.. the hype behind bigger the better is *false*, just like its for penises :)
In other words, they believe Google indexes more items based on their own tests of searching.
Tech, life, family, faith: Give me a visit
They only used words from the English Ispell word list. Besides the english-language bias, this is probably limited in other ways. News websites use a limited vocabulary, but a lot of proper names -- so if one engine indexed these better, they wouldn't necessarily get a better rating. News sites are also very dynamic and have a large number of webpages, so they would be influential in the count.
HIV Crosses Species Barrier... into Muppets
Yahoo returns a lot of dupes.
They may have more unique information simply futher down the result list, but since the search engines terminate the results at not quite 1k (1,000), the researchers have no way of testing that out.
All they can really show is that google returns more unique results per 1000 (which usually means that more items are indexed, but could be from Google's Pagerank also)...
There is always a frontier where there is an open and willing mind
Why wget instead of LWP?
(B) + (D) + (B) + (D) = (K) + (&)
TFA notes that queries with greater than 1,000 results were dropped from the survey, because Google and Yahoo both truncate their results to 1,000.
That makes sense, but it does stand to reason (or, at least, to my reason) that these queries that garner large numbers of results could have had a significant impact on the bottom line of the survey.
Those could be the larger sites, where Yahoo is perhaps digging deeper, requesting data from forms, ignoring robots.txt, etc. It could be where they're getting those big claimed numbers of indexed documents.
This boils down to the real numbers that matter. It doesn't really matter if your index is "bigger" or not, it is about the results that are returned. The other thing that matters (and can't really be measured in a scientific manner) is relevance. It's easy to return results for a set of words, it is hard to return relevant results for a set of words. My personal experience is that Google returns more relevant and better ordered results than Yahoo!.
- AMW
To me, the test is googling myself and seeing what comes back. Google seems to favor mailing lists high in its results so all the stupid things I've said over the years are right up there on front. Of course, I think Google is more accurate because things actually attributed to me show up higher in the results, but is that actually correct? I don't know.
The researchers in this article took as close to a scientific method as one can get for something like this. This just tells us exactly what has been know for away, yahoo just plain sucks at giving good results.
The big flaw in this test, IMO, is that it assumes quantity of results is as good as quality of results. I couldn't care less if a search results in 10,000 hits or 100,000 hits. All I really care about is did it return the 1 or 2 hits that actually have the information I'm looking for and are they high up in the results?
"Number of documents indexed" is a worthless pissing match as far I'm concerned.
Surely it's the quality of the results that counts, rather than the quantity? Who needs 1,000,000 matches anyway, when most people don't go past the first page or two of the results? The article doesn't talk at all about how relevant the matches were. I'm not saying that it invalidates their study, but I would say that any search engine that returns millions of hits for any query is simply showing off. Give me a search engine that shows me fewer matches, but the best hits anyday. Lately, Google has increasingly been giving me a bunch of useless links when I search for stuff. For example, looking for reviews on various bits of hardware just gives you a bunch of websites that are selling the products, and *seem* to have reviews, but then you go to the page and it says something like "no reviews have been posted". Lots of ghost towns out there on the web these days. Anyway, the point holds: Give me relevant results and allow me to screen out the marketing junk and link farms. Beyond that I don't really care how many pages they have in the index.
In fact, in the 10,012 test cases we ran, only in 3% of the cases (307) did Yahoo! return more results. In 96.6% of the cases (9,676) Google returned more results.
...
I don't understand what would make someone want to compete against Google anymore. Sure if you're got technology in place like yahoo keep it going but still
Google is synonymous with searching the internet.
Google is a verb
*DrugCheese rants*
It seems to me that when Slashdot publishes an article that is favourable to Google, that was submitted by a Google staff member, one might question whether someone involved has a conflict of interest. It's not astroturfing, because his employment at Google was clearly mentioned. It might be an ad (or more correctly, a press release) masquerading as news. I wonder if the article would have been published had it been submitted anonymously...
Why is the NCSA cowering from comparing Google and Yahoo to infoseek? The wool has been pulled over your eyes people!
No one gives a fuck whether it is 8.16 billion or 20 billion. No matter what, it is 99.9999% useless shit. Is the largest catalog of useless shit really something to aspire to?
I don't know about the study but that is the most readable perl code I have seen in a long time.
It's a nice test but ifail to see how they can extrapolate this to be true for all searches.
Don't forget that also a lot of queries get handtuned at google/yahoo to give the proper resultset.
Also to keep in mind that size doesn't matter but relevancy does!
And they both cheat at that as well, they just give back the highley ranked pages for those words. Works ok for a lot of people but hardly relevant.
This is a great article! I wish there were more like it on slashdot. It's scientific instead of an opinion piece, it has references, it's repeatable. It's also short and very readable, unlike a lot of science papers.
OK, it is yet another Google piece, but it's not "some junior analyst predicts Google will buy Apple and release OSX86box 720".
I quit!
The study noted that although Yahoo says that have ~twice as many pages indexed as google, when they queried each engine with two arbitrary words from the dictionary, they got less responses from Yahoo.
From this they concluded yahoo's claim of twice as many pages is suspicious.
What's suspicious is that these people consider themselves scientific. What if, for example, Yahoo just returns meaningful results, whereas google returns anything with those words in? For example, what if you search for "faience" and "urbanity" -- maybe google has more results, but maybe they are less pertinent - in other words maybe not only Yahoo has more pages indexed, but they have an algorithm that returns only the most relevent stuff
Not saying that's the case necessarily, but not mentioning that assumption makes for a worthless study/conclusion. (also if google says they return x results, often when you go to the last page of their results listing you'll notice their total went down, and its more like x - 10%)
-Josh
While it is true that more results could mean worse filtering, that is a separate test entirely.
I tend to think that ordering is more important than filtering down to a small number of results, since having lots of results returned doesn't hurt if the search engine can order well so that what you want is most likely to be in the top 10-25. This is especially true when there will be at most a couple of results where I'd rather have the search engine try at the ordering and have me do most of the filtering because no search engine is as good as a person at really figuring out what people want, yet.
How many more times are you going to whore your site in your comments? Out of 9 comments you've made on /. 5 of them have included a link to your site in the body of the comment. If it's relevant, fine, if not then stick it in your sig or profile.
The very methodology used in this case seems rather incorrect to me.
The assumption (as stated in the paper): Since Yahoo claims to have indexed twice as much as google, searches should return twice as many entries.
That assumption is flat out incorrect. There are actually multiple problems.
First, the scope of the search (based on index terms) is really up to the search engine itself. Since each search engine does not return the entire database as search results, it is very much up to the individual search algorithm to determine the depth of entries considered to 'match' a set of terms. That's what is really being reflected in these results.. it is not the overall size of the index, but simply how aggressive the search algorithm is in matching terms to entries.
Even if the algorithms where identical (same algorithm being run across both indexes), the nature of search does not scale in that way. If Yahoo has, for instance, becomre more aggressive in indexing message board and forum content, then only searches that play to those subjects should return more results than Google. Since searches are by definition narrowing on a data set, a methodology needs to be developed that more effectively tests the BREADTH of the results more than simply testing the depth.
Turn s60 photos into awesome videos with mScrapbook for all S60 3rd edition phones!
I'd be much more interested to see a test of the quality of results. Considering that most of the results that I end up activating are on the first page, quantity of results is less relevant to me in determining a good search engine.
I'll take some of the heat off you. Let's burn some karma. Here we go. MODERATORS ARE STUPID FUCKERS.
The study only checked English words. Is it possible that the increase came from Yahoo expanding into more international website markets?
Just a thought
Basically NCSA's method assumes that if a search engine indexes twice the number of pages, than it will return twice the number of results for a given search. However, in order for this to be the case, the 10 billion+ more pages that yahoo indexes would have to be roughly equivalent to the pages that google indexes. If Yahoo is indexing 20 billion pages, but ten billion of those are in mandarin, than searching for random combinations of english words (which NCSA is doing) won't tell us which search engine indexes more pages. In order to trust NCSA's methodology we would have to know exactly WHAT the billions of pages that Yahoo knows about but Google does not are. Surely the web didn't double in size overnight, Yahoo must be searching somewhere Google doesn't search if their claims mean anything (which they may not).
It's good to see that slashdot is FINALLY posting an article about Google.
In fact, all results that match a query are returned, it's the ranking that matters. Google is also more rigorous about excluding apparant duplicate results, and don't count those in the stats.
Seriously. 'We wrote a script and here are the results'? This would take an average PERL programmer what -- 30 minutes of work? Has academic research in computing really sunk to this level?
Maybe it's not even worth pointing out how badly flawed (and lazy) the underlying assumption of 'twice the results = twice the index size' probably is, as I'm sure we're going to see a few dozen posts to that effect (unless PageRank really means nothing), but at least I can complain about the slant they put on this, and how strong a conclusion they seem to derive.
Do you really even have to RTA? My search engine is still the same as before and works fine, no need to change now. Awe, is Yahoo jealous?
...though flawed in many respects. The raw number of pages returned may not indicate the size of indices. Google is famous because it returns *relevant* pages but not necessarily *more* pages. A search engine that returns its entire index with each search isnt all that useful.
:p
Secondly, results for all keywords may not increase with the size of the index. The pages which were indexed might correspond to popular searches (that return more than 1000 results, which were not considered if you RTFA) - so considering only those words that return less than 1000 results is flawed.
Though some competition is good, the "DO YOU WANT MY 20 BILLION BIG INDEX ???!!" claim by yahoo reminds of certain yahoo chat rooms
yahoo forgot to index all /. dupes.
the assumptoins seem to be that sarch results are randomlydistributed. But by teh very nature of search - a targeted and subjective request for information - that is clearly the wrong model. I don't se why the assumption that a 2x bigger index should return 2x more results for any query 1000.
A better test would be to see how much overlap there was between queries. Do the top 50 returns on queries (ofany size, not just imited to those with N 1000 returns) match? to wuithin what percentage?
Don't blame me - I voted for Howard Dean. http://dean2004.blogspot.com
Google started treating plurals as the same search about a year ago. Yahoo doesn't. So, if you google for "inkjet printers" and "inkjet printer" you will get the same result set; however, on Yahoo, you will get different results.
The net result is that for the same index size, Google will return more results. (And, IMHO, more meaningful ones.)
the number of results anyways? Who makes it to page 5000 when doing a search?
They presumed that for random phrases that return less than 1,000 matches, one can determine between the ratio of matches that Google returns and matches that Yahoo returns, which engine has indexed more documents. This also presumes that the Internet is an infinite source of information about XYZ, and that there is always an indeterminate number of sources that remain unindexed on both engines. I don't think this is the case at all.
Say I write a page about Jabberwocky. I get together with people that write more pages about Jabberwocky, and all of us have on three domains information about Jabberwocky that exists nowhere else, except maybe Wikipedia under the Jabberwocky entry. If both sites index Wikipedia and those three domains (that link to each other), that's 100% coverage... barring horrible algorithms, you can't get less than this, or you get nothing at all.
Also, when you're looking around for such unique information, I have to imagine that it's not representative of other sources in more general searches.
-Rob
Biblical fiscal responsibility
So in the conclusion, the author writes that since Google displayed more results, based on their random test data, it was the superior search engine? That seems so wrong somehow...
Wouldn't a better search engine return less, but more appropriate results? I mean, how many of us have found the information we were actually looking for on page ten or twelve of a search. And, isn't less more, but better? %insert Linux geek laughs here%
One would think that volume of results would not a better search engine make, although it may indicate a larger engine index size; an expicit statement to that effect seems to be missing from the NCSA report.
-Runz
This is just another example in the age old argument of which is better. IMO, the quality of the search results is what matters more than the sheer quantity of information. One relevant find is more valuable than 100 inaccurate results. A test of accuracy might be more valuable and one that would be difficult to engineer. For instance, if I type in a word that has a direct correlating .com domain, that should be the first result (assuming no other words in the title - i.e. "hagrin" brings me my home page as the first result). I am sure a test of accuracy could be further derived from such logic.
The other side of the argument probably relates back to something my fiancee once told me - "Size doesn't matter, but it's the great equalizer when it comes to two guys not knowing what they are doing". Yahoo!, especially since the researches couldn't perform queries on topics returning more than 1,000 results, may be indexing and crawling deeper into sites or it has a "double dipping" problem.
Either way, I don't see Yahoo! falsely reporting their numbers - I would tend to think that this "study" is highly flawed due to its exclusion of larger result topics, etc.
Hagrin.com
The research has several problems:
a. It measured number of results for a certain
query, even if we assumed identical algorithms for checking if a page matches the a query, the two search engines are likely to use diffrent relevancy thresholds.
b. the search pretty much limited itself to the
english language.
c. as they admit themselvs they measured only obscure queries, actually most of my queris
are not obscure at all and it takes me more then 2 words(which fit together) in order to chop down
the search results group.
d. finally the entire research has very little to do with the really intresting question, which is which search engine is more likely to give me the results I need on the first page?
Me.
What's this? A concise and well written summary with a link directly to the well written article? No twisting/breaking of the truth in order to incite /. groupthink comments? No pointless plugs for unrelated topics? No ADS?!?!
Jesus, the editors keep that up they might actually have a worthwhile site going....never fear, I'm sure the next dupe and/or an article comparing spooning to unmanned space travel will surface before the day's end.
Monstar L
The NCSA's test neither confirms nor disproves Yahoo's earlier claims. Their lesser average results may just indicate higher quality threshold -- Google's results beyond the second page are never useful either.
I'd say, it is kind'a early to claim "pants down, egg on face"...
In Soviet Washington the swamp drains you.
funded by Microsoft?
RTFA again for the best results.
Google not only gave MORE results, it gave BETTER results. The only bad results were some hairsplitting (if largely well meant) from fellow /.ers... (I mentioned Tuva as a suburb of Mongolia, and while it IS a part of the Russian Federation, it is Much More Mongolian than Russian. And if the rising tide of neoNazi scum in Russia get their way, Tuva could easily be cut adrift into the Mongolian/Chinese orbit...but I digress...)
The essential point is: Which Does the Job Better For Me? Google. Therefore, I use Google. Assuming the Copernican position that I am not atypical, I would therefore extrapolate that this is very true for most other people as well. Which means that Yahoo has a LONG way to go and A LOT more work to do.
RS
Shoes for Industry. Shoes for the Dead.
With each having billions upon billions of documents available and indexing more everyday who really cares?
For some reason I refuse to use either spell check or the spacebar properly.
Thus, for the purposes of this study, we were forced to restrict our searches to those queries that returned less than 1,000 results on both Yahoo! and Google. Any search result found to have more than 1,000 returned results on either search engine was disregarded from our sample.
my question is which search engine required them to disregard their sample the most. Did google hit the limit the most or was it yahoo?
By the way I love google but I do think yahoo indexs more pages. It index personal pages moreso then google does. So when I am searching for items which I know other people would point to I hit up google. But if I am searching for something that no one has a reason to link to (home page of your gf) I hit up yahoo.
Never could figure out why my girl liked my bitch tits, then I found out she was a lesbian.
Is it just me or does this study not sound convincing enough? There are too many holes in the way the study was conducted, IMHO. First of all, they restricted queries that return less than 1000 results? They're already limitied the sort of queries they're executing by choosing those that return significantly less results that other "popular" queries.
Secondly, they chose random words to create a query. This doesn't give me the confidence that this belongs to the same space of queries that people execute on the average. It would've been great if they sampled their queries from those that people actually execute instead of just crawling the english dictionary.
Nevertheless, bigger is not always better. The reason why Google became such a phenomenon was because of the quality of their search results. Duh.
"Injustice anywhere is a threat to justice everywhere." - Martin Luther King, Jr.
a picture tells a thousand words
Mcdonalds obviously isn't hiring
Mod parent up.
You see? You see? Your stupid minds! Stupid! Stupid!
I was on a page reading about Windows Longhorn and Google showed me ad's about Cattle in Texas I could buy... with all the 1337 hax0rz and ub3r geeks they have at Google, Inc, can they not fix the "context"
pooyak.com
Nice to take an anti-yahoo submission from a Google employee. I guess I should be happy they at least disclosed the conflict. It's more than you can say for someone like Bob "rove-puppet" Novak.
before someone else does.
Google only reports "about 4,820,000" entries for Britney Spears, while Yahoo reports "about 67,100,000" entries! This makes Yahoo more than 12 times better than google! Yeah, my methodology is completely fucked up... but then, so is the NCSA's!
I've abandoned my search for truth; now I'm just looking for some useful delusions.
If $SEARCH_ENGINE returns 1,000,000 results, and assuming I can sift through each result at an astonishing rate of 1 per second, it will take me 1,000,000/(60*60) = 278 hours, or 11 1/2 days to wade through the junk.
The number of results is largely irrelevant. Give me quality filtering instead. Fortunately, Google does that for the most part.
Search: Valerie Plame
Google: 908,000
Yahoo: 2,580,000
Search: "Boulder, Colorado"
Google: 1,600,000
Yahoo: 5,880,000
Search: "Linus Torvalds"
Google: 2,560,000
Yahoo: 5,870,000
I assume it goes on like this. Of course these exceed the 1000 maximum hit limit given in the study.
While it would be interesting to know how many pages the big search engines index this isn't a way to measure the size of them. I am ready to be proved wrong but as far as I can tell this is totally flawed.
The number of results given isn't a measure of the size of the search set unless you also know the algorithum being used. If both search engines use an algorithum is designed to just find pages with the given word and return all pages then this will work. However that isn't necessarily the case I imagine both google and yahoo will return a smaller set of pages at times of heavy load or possibly it you screw about and do 10000 queries from the same IP address in 5 minutes. To prove the fact this experiment doesn't work why don't they come and test my super wizzy search engine. It will give them 999 results for every query.
I used to have a better sig but it broke.
The interesting thing is that the top three results make no reference to the word failure. Of course it is probably based on pages linking to these three, but I wonder if they should even be included for the lack of the search term?
Jumpstart the tartan drive.
Teoma is better than google or yahoo, so i think the point is moot.
http://www.teoma.com/
Hyperlink spam
Looking at the first item in their result log, I'm unimpressed.
Yahoo returns 0 results, and Google returns... 4 different links to the ispell dictionary (or variants thereof).
('carbolization clambers')
True, but how many times, when searching, do you look past the first 1000 results? Heck, I rarely get past the first 20 or 30 before refining my search. I don't belive the usefulness of results past even the first 100 or 200 results should be considered when comparing search engines. An interesting survery would be how many pages deep a person will look when using search engines.
That's irrelevant in this case, certainly. This wasn't a judgment of what the best search engine is, but instead which search engine had more results. This was strictly quantity, and not quality.
They were testing Yahoo's claims to be indexing more pages than Google. They found the claim to be false. The quality of the searches weren't the subject of Yahoo statements or NCSA's testing of those statements.
That's not to say it isn't an interesting question, but that it really wasn't relevant to the article.
The study only addresses the issue of size of the indices and returned results. Understandable, and it certainly debunks Yahoo's claims, or at least, makes them irrelevant -- what good is a 19 billion-page index if you don't actually get any more search results?
But the real utility of a search engine is the relevance of those search results. Google has been successful because its search results are relevant to a large portion of its users. The real question when comparing search engines is, can one help you find what you're looking for faster than another?
Yahoo may have a huge index, Google may return more results, but neither metric alone will tell you which one you actually want to use for general internet searching.
Terrible example. Search for "http" ... MUCH more interesting. They don't even strip "http://" off the URLs when they do their scoring!
"Old man yells at systemd"
That makes sense, but it does stand to reason (or, at least, to my reason) that these queries that garner large numbers of results could have had a significant impact on the bottom line of the survey.
Well, there's a worse bias. They're grabbing words from an Ispell word list.
There are websites which contain the Ispell word list. There appear to be more of those returned in Google as results than in Yahoo. (here is one returned in Google for "apprizers expense", but which is not returned in Yahoo.)
This basically contributes a pedestal to their result - they'll never get zero results, because they'll always get the Ispell lists back, and because those results always return the same number (about 8 Google to 1 or 2 Yahoo), you'll bias the results of the entire set to that result.
They needed to remove results which are returned in common to multiple searches, as that's essentially double counting.
Maybe Yahoo indexes more useless pages than google does.
try it. for example search for "swans" : you got 1 510 000 results, the first one is the SWANS rock band site. search for "swan" then - 8 550 000 results, the first is some SWAN social network - the rockers are not on the first page at all
Deliriant isti Americani.
I'm not exactly sure why Slashdot would choose to publish such a poorly conducted study as this.
The entire experiment is founded on the idea that there is a strong, if not direct, correllation between returned results and index size, which is absolutely rediculous. Given that each engine's search algoritms are so closely guarded, there is no way to tell what sort of correllation there is between the number of results for random queries and the searchable index size. Without addressing this issue, this article looks to be nothing more than part of the typical google fanboy fare posted here, and it's frustrating to say the least.
Its not the size of the index .. its how well you use it! ... course a large well used index is even better. har har.
Have you ever tried to view the 10,000 pages that google returns? It's impossible. What's the point of saying that thousands if not millions of pages are found if only the first 400 can be viewed?
Do the simple test of searching for failure in google. Next, go to the very last page. Google claims there are 80,100,000 pages with failure but I could only view 899 pages while showing the omitted results. What and where are the other 80+ million?
There's an inherent assumption in the Yahoo claim that more==better. Do I really care if a search returns 1 million results vs 6 million results?
What I care about is actually getting the information I went out to find. There's only a certain amount of hits I'm willing to explore. That's probbably on the order of 100-200 or so if I _really_ need the information. The implication by Yahoo is that more hits == better top ranked hits. Is that true? Really what should be done is just compare the top few hundred hits between the two search engines and see how they differ. Those are the only ones that matter anyway.
Where more results might prove usefull is obscure searches with less than 100-200 hits. But if this study is true, Yahoo does a worse job on obscure searches that google.
The problem of course is the type of obscure searches that this study performed. Two random words out of a dictionary just isn't what your typical person conducting a search engine query is looking for.
AccountKiller
Should be "fewer" results, methinks. Doesn't NCSA have editors? Thought they were kinda up there as professionalism goes.
I don't know about 12 times. You've got to be realistic and do something thats up to date as Britney is soooo last season. Jessica Simpson on the other hand occurs 3.78 million times in Google as opposed to 21 million in Yahoo. Google is gaining ground as Yahoo is now only about six times better than Google.
For some reason I refuse to use either spell check or the spacebar properly.
1. Assumes that Yahoo's expansion is random. If the increase in Yahoo's pages are not random, then the results may be skewed. For example, Yahoo's expansion may have been mostly, or even entirely, in pages built of common words that all receive more than 1000 hits upon searching.
2. Assumes, as many people have stated, that by using an English dictionary for its seeds, the study assumes that Yahoo's expansion has been in English. If Yahoo has expanded it's database in non-English pages with few words that overlap into English, those pages will not show up in the study.
This study essentially determines that Google has a larger database of random, obscure English language words. Consequently, they demonstrate that Google is the superior search engine for finding obscure, random English words.
One additional check that they could have thrown in would be how many of the pages in the links presently deliver 404 errors. That would have been far more interesting to me than how well the search engines do at finding obscure and random English words.
If you search for "hnc software", the first hit is Fair Isaac. The Fair Isaac web page has no mention of HNC. And yet, this is appropriate because HNC Software no longer exists since Fair Isaac bought them.
Google does the right thing, it's googlebombers who are messing with your head.
My amazing wife - Artist, Author, Philosopher - Laurie M
Why is the NCSA cowering from the Google vs. Infoseek comparison?
(yes, yes, I know it uses the Inktomi engine too - would you have preferred a prodigy reference?)
Of course the study also demonstrates that on the searched terms, Yahoo's estimate numbers vastly overestimated the number of available results they actually found. So if the pages from the study are even close to representative in that regard then this would make the numbers you quote utterly meaningless.
Which is the entire reason, of course, why they kept the limits under 1,000 in the first place-- that for any number over 1,000, if the search engine says, say, "I found "2.5 million results for 'Valerie Plame'", you have no way to tell whether it's telling the truth or not.
Irritable, left-wing and possibly humorous bumper stickers and t-shirts
Figures never lie, but liars often figure.... Now where's the (*&% calculator?
well anyway here's a site that does a comparison by putting google and yahoo in frames and doing identical queries. http://www.googleguy.de/google-yahoo/ But I thought yahoo also uses Google's results in their thing. Althought that might have changed. I remember a couple of years back Yahoo had a little thing on their search results that said powered by google. So of course yahoo will have more than google since they are using Google's results along with their own. http://www.langreiter.com/exec/yahoo-vs-google.htm
this site also does comparisons, but it shows a nice little graphical thing. I think it shows how google and yahoo results overlap. Oh and on a side note, I'm new here and I was wondering whether the term "slashdotted" means that a site was overwhelmed with traffice from slashdot.
All your base are belong to Wii.
Specificity does. A good search engine will find all relevant pages. A great search engine will list the page you're looking for in the first ten results.
Often times, the only thing a bigger index does is make the user scan more results before finding the page they want.
The society for a thought-free internet welcomes you.
Personally I'd design an experiment where I also look at queries that return more results that can be checked, check the first 1000 and see if the ratios you end up with still seem to hold true. If they do, then the 1000 hit limit is an unnecessary constraint.
Your argument that the result is skewed by differences in "depth" of crawling etc. probably has little merit. For it to have any merit this would imply that global term frequencies in the search engine dictionaries would be skewed by "going deeper". There is no indication that this happens in any recent research I've seen.
Just to let everyone know, Chris Dibonia (the poster), is in charge of the Open Source arm of Google.
This means he is the one that pays me for the Summer of Code, so be nice!
Just an FYI for those who want to try some-
thing other than Yahoo or Google. Look at
Teoma (teoma.com). I've been using it for a
while. Seems to work pretty well though I'm
not sure if it's quite as good as Google.
Here's a little blurb from their website:
Teoma's History
Teoma was founded in 2000 in Piscataway, New
Jersey by a team of scientists from Rutgers
University. Teoma means "expert" in Gaelic. Ask
Jeeves, Inc. acquired Teoma in September 2001.
And, no, I don't work for Teoma.
Search for:
+the * *
Yahoo returns more results.
I like http://www.clusty.com/. This meta search engine clusters results according to relevancy.
This research does not even hold for what its worth. First of all, Yahoo indexes X amount of pages but does not mean it would display X amount of results based on X amount of pages indexed.
Using the google whack search for "passalong louse",
Google returns 647 results for passalong + louse but look at the results it returned. It has "passa over", "passa-long","passalong", etc..
while Yahoo only returns 3 results all containing only "passalong"
So accuracy wise, yahoo wins on this.
yahoo has some 20 billion items and google has some 8 billion pages pages != items I say that we query Yahoo! for how many pages they really have. Not items.
This study assumes that both Yahoo! & Google rank pages the same way. WRONG!
Google's methods for determining a page's relevancy to a search term varies widely from Yahoo's methods.
In order for this type of study to have any validity, identical ranking methods would need to be employed over both indexes.
With Google for a page to be found, other pages that reference the page may contain the requested words, but not the returned page itself.
One of the lecturers in my department used to include a copy of the ispell (or maybe aspell) word list on his site, in a random order for the data structures module. The coursework in this module consisted of putting the word list into various data structures and searching / sorting it. One year, he got a visit from Interpol. Apparently they found a particular sentence in the middle (by using a search engine) which appeared to be related to some form of organised crime. Now he zips the wordlist...
I am TheRaven on Soylent News
Many of the points made in the comments so far and what it would take to get accurate numbers for comparative purposes are mentioned in the Search Engine Watch Blog post from last Thursday.
4 48
http://blog.searchenginewatch.com/blog/050811-231
When I google for "jesus site:holy-bible.us", Google returns no results.
I could have sworn that Jesus is mentioned in the bible somewhere...
My guess is that yahoo just indexed a lot of data from searching other search engines with bot spiders.
After searching for myself in Yahoo I found a site I had about 8 years ago, and that is already dead (was in geocities) for more than 4 years. Even more interesting... the contents of the site were the first version, not even the one in the site when it was closed.
Size doesn't count it's how you use it. Or so I've told. Er, um, .... of course *I've* never been told that, I'm just repeating what others say.
Move along now.
> > >We don't need no steeekin'.....oh wait, my wife says we do.
Google automatically includes stemming in searches, but not necessarily at the same ranking of the original search term. So while searches for "inkjet printer" and "inkjet printers" will not return the same results list, many of the results from each will be included somewhere in the results list of the other.
Build a man a fire, he's warm for one night. Set him on fire, and he's warm for the rest of his life.
Just don't take that payment in Google stock. Chris DiBona is the kiss of death for a corporation: he has NEVER worked for a profitable company and several have tanked while was there. He is the kiss of death for a company.
The most basic measure of performance in Information Retrieval is precision vs. recall.
Precision is how many of the results that you return are correct. e.g. If Google returns 100 results and 10 of them are correct, then the precision on that query is 10%.
Recall is how many of the correct results you return. e.g. If Yahoo returns 100 results out of a total 1000 correct matches, then the recall on that query is 10%.
Information retrieval systems such as search engines balance these two metrics -- which are fundamentally at odds with each other -- to give the "best balance" in the eyes of the system's designers.
The NCSA study basically misses the effect this decision would have on perceived size of index.
A simple demonstration shows how it works.
First let's say both search engines have the same index size: 10B pages. Second, let's say both search engines have exactly the same apriori capability for precision and recall, but can tune for a preferred performance. Yahoo decides it wants to favor more precise results over more results recalled, at a 2:1 relative ratio compared to Google.
In that case, any given query will show half the hits from Yahoo as compared to Google. Concluding Yahoo's index to be half the size of Google's, given this result, would be incorrect.
Furthermore, without knowing the precision/recall performance of either system, they can only demonstrate a lower-bound on index size, and that certainly doesn't predict average or max index size.
This really does not work out well. The idea behind doing any kind of research is to eliminate all the variables except the one that you are testing. That way you are not trying to compare apples to Oranges. Unfortunately these 'study' failed to take into account the system for returning results, the system for indexing the pages, and the system for applying weight to something. Google (My favorite search page) tends to return results because something that links to that page or is linked from that page matches your search. Yahoo however does not seem to do that even 1/10th as often. This could account for lots and lots more results. In fact I am sure that I could build a seach engine that could index less than a million pages and turn out more results than Google every time if I make my search engine open enough in the way it returns results. Personally I am more upset about this than the stuff that is obviously opinion. When it's opinion only fools (And there are plenty of them on this site) mistake it for science and fact, but this is like watching a Michael Moore movie.
Whenever I am searching I rarely notice a difference between Google and Yahoo's results, at least on the first couple of pages. Yahoo may claim to have twice as many indexed pages however I have yet to see any results in my queries.
I did a few spot searches myself and one thing that makes a huge difference is google does "smart" searching.. if you type in a phrase and google suggests that you meant something else, it will search for that as well and combine the results. This would give google a larger result set. Therefore it is impossible to determine whose indexe is bigger because the way they build their search results is inherently different.
If I'm searching for something, I want to find it. I don't want to have to search through extra data. This article seems to point out that it is "estimating" the amount of results returned. Which I think is un-important. What I think is important is the validity of the results to the query I type. I don't see how these figures show "Quality" any more than "Quantity".
I agree that searching random words cant be considered a real test. For a real world test (with not too many results) I searched my name and surname both Yahoo and Google and found that the number of result was quite similar, but curiously some websites was listed only on yahoo and some only on google.
Could be interesting do a statistic about that for a large amount of people. Anyway better two search engines than one!
.... or Google stopped publishing certain results which are unlikely
Consider this search --
Terms: centerable's heterolecithal
Google totals:
Duplicates Omitted Estimate: 3
Duplicates Omitted Total: 3
Duplicates Included Estimate: 3
Duplicates Included Total: 3
Yahoo totals:
Duplicates Omitted Estimate: 0
Duplicates Omitted Total: 0
Duplicates Included Estimate: 0
Duplicates Included Total: 0
I typed the term "centerable's heterolecithal" into both Google and Yahoo and they both return 0 results. The author claims 3 from Google ????
Hope he hasn't doctored any results to show in favor of Google.
Anyways this whole survey was a pointless waste of time and to add to it my time too.
Please mail me $50 for my time wasted.
I am interested in results. I understand that more pages indexed means normaly a better result. However I just want to get to the information I can se for whatever it is.
http://vivisimo.com/ is an engine I like using, becaue there you can get to things that you want rather quick without the need of looking though pages and pages of non- relevant pages.
Don't fight for your country, if your country does not fight for you.
The point (IMO) is how many of the results are useful for my search.
How many of the results (the comparison by NCSA says A gives xx.x% more than B and such bs) are *really* what i was searching for?
Rethoric answer: very a few.
Even worst, many of them today are search engines who point to search engines who point to search engines and so on, in a meaningless loop.
The main richness of a (let's say) Yahoo or Google or such, is no more in the number of indexed objects (hey, we're still talking about billions of items; one more or one less doesn't change the matter), but in correctness of the answers.
Queries for all-night love machines with dicks over 12" in length returned several hundred results!!
Actual mileage may vary!
Build a man a fire, he's warm for one night. Set him on fire, and he's warm for the rest of his life.
...if you do a search for something on Google and it comes back with a small number of results, and you get to the last page, it often says "In order to show you the most relevant results, we have omitted some entries very similar to the 8 already displayed. If you like, you can _repeat the search with the omitted results included_." So the dupes are there to be had if you want'em.
Dear Slashdot: next time you want to mess with the site, add a rich-text editor for comments.
Do people even yahoo bomb? The price of being the biggest.
Brain(s): 0.0% user, 1.3% system, 0.1% nice, 98.6% idle
Yeah, except that Yahoo's estimates are bogus. See for yourself: search for 'arabesque hard disk screws' or some other rather obscure term. Yahoo reports about 602 results for this example. Now click through the result pages, clicks "repeat with the omitted results included", again click through the results pages. Where do we end up? A lousy 120 results! 602 results my @$$! And this is just one random example, I tried many!
Yahoo says it will give you 418 "Ronald Hendrickson" pages, but only gives you 110.
Yahoo gives you the full 175.
My amazing wife - Artist, Author, Philosopher - Laurie M
Perhaps this is biased? An American Civil Liberties Union supporter with a personal interest in "race relations" must certainly have a nitpick with Yahoo after the law suits of the organization against Yahoo, and what with nazi memorabelia being posted and so forth. Perhaps that was the motive behind the slant in this research?
Let's settle this once and for all - which is the better search engine - in the only way possible:
GoogleFight!
Clearly the study is Google biased. Why look at these results from Google. Then, look at these results from Yahoo. Clearly, Yahoo is the winner with more results.
Oh wait. Um, I guess that's not a good example. In this case I would take Google due to fewer results. No, no, I would take neither.
If you're wondering, I'm watching Family Guy right now. Yeah, that's my excuse. What's yours you pervert?
... the number of pages that are "indexed" and those that are actually IN the "index"... For example, using a weblog processing tool I wrote I discovered that search engines will frequently access the same page over and over again... So Yahoo may be calling their "index" size the number of items that have been indexed... rather than the number of items actually in the index... For some this is a semantic difference, for others "truth economics"... Also, when you check the estimated pages for any particular site for example: http://www.google.com.au/search?hl=en&safe=off&q=s ite%3Aslashdot.org&btnG=Search&meta=
and:
http://search.yahoo.com/search?p=site%3Aslashdot.o rg&prssweb=Search&ei=UTF-8&fr=FP-tab-web-t&fl=0&x= wrt
you can see that they tend to vastly overestimate the number of pages that a site has...
A more useful way of estimating the index sizes would be to use the "site:xyz.com" searches for both... Of course the robots.txt file for each site would need to be considered however in case the webmaster(s) have decided to lock out a particular engine...
While the Perl script is both nice and readable: http://vburton.ncsa.uiuc.edu/compare.txt
n +clambers + displosion 7 s+heterolecithal + Oistrak e s+multiplications g radated p h+overheard i tudinize o educationalizes
The log results are shown here: http://vburton.ncsa.uiuc.edu/searchresultlog.txt
For instance, the following queries were supposed to give 5+ results on Google and no results on Yahoo, so let see if that actually works...
Sometimes you get no results on Google "on the first tries" go figure... "that server is down/busy?!"
If you get any results they are the same repeating over and over ispell dictionary word list!
I don't know about you but that's pretty useless...
Also, the fact that both search engines limits to the first 1000 results, that's pretty useless, how can we know for sure there is 100000+ results for apple if after page XYZ, results are truncated?
Here's some queries:
http://www.google.com/search?hl=en&q=carbolizatio
http://www.google.com/search?hl=en&q=anecdote%27s
http://www.google.com/search?hl=en&q=centerable%2
http://www.google.com/search?hl=en&q=unobservable
http://www.google.com/search?hl=en&q=misanthropiz
http://www.google.com/search?hl=en&q=buttonmould+
http://www.google.com/search?hl=en&q=myocardiogra
http://www.google.com/search?hl=en&q=pinions+plat
http://www.google.com/search?hl=en&q=sloppiness+c
I don't know what this article is crapping on about but in my (admittedly limited) test i got far more results using yahoo than google.
This is not a valid study. The biggest issue is it selectively choosing search terms which yield less than 1000 results. For instance the query ipod would not count. But perhaps "my monkey broke my ipod in my camry" would count. Point is the study proves that Google has more results for very obscure queries that yield very little results (1000). Give it to the NCSA they did they best they could to do a study with whatever information was available. It got people to at least think about it.
Unfortunately, both the Yahoo! and Google search engines truncate results returned to the user after 1,000 results. Thus, for the purposes of this study, we were forced to restrict our searches to those queries that returned less than 1,000 results on both Yahoo! and Google. Any search result found to have more than 1,000 returned results on either search engine was disregarded from our sample. [3]
So. Let's say that Yahoo! and Google didn't restrict the results to 1000. Let's say that some search returns 1,000,000 results MORE on Yahoo! than on Google. Let's say this happens many, many times.
This entire study would be invalidated.
Since you can't know how many results there actually were for each over-1000-result search, how can you tell which engine has more pages indexed?
If Yahoo! had 20,000,000,000 pages indexed and Google only 9,000,000,000, and one term appeared in every single page, you would get a result of "1000" for each engine, even though the real difference is 11 billion!
Flawed? I think so.
I am scientifically inaccurate.
Yahoo appears to report more hits that it has.
For example, try out "dorani". I do not know what it means, but it's a good choice as it shows ~20000 hits. Clicking on the next result page, we can see the full results.
Yahoo(dorani), 1st page: 17,000
Yahoo(dorani), 6th page: 16,800
Yahoo(dorani), 90th page: 4,220
Google(dorani), 4,330
I have seen this pattern consistently for terms bringing between 15,000 to 100,000 results in Yahoo.
When Yahoo is asked to show the results, they diminish.
As mentioned in other posts here, Google's search results include pages linked from pages containing the search term[s]. That is, the documents in the search results may not themselves contain the search terms. If Yahoo's search results does not include such pages, then we can expect a systematic bias in favor of Google when counting the number of pages indexed based on the number of search results.
The point is, the study does not check whether the pages referenced in the search results do indeed contain the search terms. This extra check should be fairly easy to do with a small random sample.
The results of the sample can then be used to calculate a bias coefficient (scaling factor) for each of the search engines.
I checked out the NCSA home page and I didn't find anything about advertising assistance for google vs. yahoo. This is a waste of money. Who pays for these people.
What's next? Is NIH going to do a taste test between coke and pepsi???
Yahoo! puts the interest of their advertisers above the interest of their users. Google serves their users first -- and, by doing so, attracts the eyeballs with which to gather advertisors even without using dirty tricks to get their ads viewed.
Okay try some large results searches.
"http" 2.1 billion Yahoo claims, 2.36 billion Google claims, and Google was back in half the time.
But Yahoo gives you a page titled "Hypertext Tranfer Protocol Overview www.w3.org/Protocols", Google gives you the Microsoft website (huh?) in first place.
Hmm, after trying a few others I've decided I need a simpler ranking system. As the results are far too evenly balanced to call.
Yahoo gives me higher ranking for my own name than Google, so clearly it has a better algorithmn, anyone who disagrees will have to fight my ego.
We see interesting results by searching for "yahoo search" in google and "google search" in yahoo.
9 48223
By the way this is only the first step for building great search engines as outlined in http://slashdot.org/comments.pl?sid=154275&cid=12
Slashdot = Sarcasm
Simple exercise: search "aleut handshaking" on both engines - no quotes. Google gets 114 hits, Yahoo gets 32. Yay Google. Now take a closer look at those hits.
Yahoo is better than Google at blocking out these "Search Engine Optimizers" aka spammers.
Never assume Google always returns the best (most accurate) set of results. Example from real life: Just 10 minutes ago, I wanted to know the weight of the Head Ti Radical tennis racquet that Andre Agassi formerly used. My query was "agassi head ti radical weight". After going through four pages of results from Google, you couldn't locate the info. Tried Yahoo! and its very first result was SPOT ON!
You're stupid. The links are clearly marked as sponsored links.
I'm sure that the relevance math involves some complicated formulas, but simply determining a "match" is simple. Does the document contain the search terms? Yes or no. Google finds more results. That leads me to conclude they have more pages indexed.
Joseph?
...to this question is simply for Yahoo and Google to print a copy of their respective caches. Then, assuming that the same font/size is used, it should be easy to identify which is larger.
Honestly, sometimes it's real easy to over complicate matters.
Marked, yes. Clearly, no. I'm simply a casual reader in this context -- I don't notice things which aren't in the way of my eyeball, and the way Yahoo! formats their notice of those results as sponsored, said notice isn't.
The one mistake these researchers are implicitly making is that they assume Yahoo and Google are both using the same search algorithm.
Perhaps Google is just better at matching query strings to results, because it finds more relevant results. Or perhaps Yahoo is better because it excludes more irrelevant results.
Either way, this says nothing about the size of the database. In information theory, there are these terms, "precision" and "recall". I forget exactly what they mean, but they have something to do with how many results you get that are correct compared to how many results you get and something to do with how many correct results you get compared to how many correct results exist in the whole index. Something like that. Anyhow, surely, Yahoo and Google will differ, and THAT is what we're measuring here.
www.googlefight.com
Yahoo wins, 295,000,000 to 270,000,000
Oddly, a manual Yahoo search yeilds:
Yahoo wins, 866,000,000 to 473,000,000
One wonders if the methods of this paper err in assumptions about the types of content being indexed. If the increase in pages indexed by Yahoo is due to formal, published content, or non-English content, or (pick an option), then it might not translate into more hits given obscure word combinations. That is because the additional content isn't a random selection of possible web pages.
Just a thought.