Yahoo Passes Google in Total Items Searched
tonyquan writes "Yahoo announced today that its search engine passed Google's for overall capacity, with 20 billion documents and images indexed versus 11.3 billion for Google. Observers had previously pegged Yahoo's index at just 8 billion items. The growth is due to a recent expansion effort. More info can be found on the Yahoo! Search blog and at CNet."
My google-fu isn't bad, but I sometimes have trouble finding relevant results. I figure adding 9 billion more possible results should complicate things quite nicely.
It's interesting to see that Yahoo! may have surpassed Google on this metric. Over the past decade, Yahoo! has beaten other "hares" to date, including AOL and Microsoft's MSN. They're doing some innovative stuff, but also have some areas to catch up on. More here: http://mp.blogs.com/mp/2005/08/on_the_merits_o.htm l
Now all Yahoo has to do is create a real search engine that can actually spew out relevant results amongst those 20 billion entries...
"A door is what a dog is perpetually on the wrong side of" - Ogden Nash
...now it'll be even harder to find anything on Yahoo! Google keeps and holds its users because searches *work*. When I search for something, Google has a very high chance of giving me what I want in 4 pages or so. Yahoo! isn't as good at getting me the information I want. The problem might even be made *worse* with all these pages. Yahoo! has never said, AFAIK, how it ranks pages, but Google does it better. With this wealth of data, the ranking system is going to be under much more scrutiny at picking the right pages.
That's not a bad thing. There are a lot of useless pages out there, and having twice as many pages in the index certainly does not mean twice as many useful pages.
I am glad to see the search engine wars are on and competitive.
Why isn't programmer efficiency measured in KLOCs? Because quality is more important than quantity when used as the only metric.
I don't believe that volume of pages is really a relevant metric to be used in the case of search results. With an infinite number of pages the real metric comes down to relevance.
Stay tuned for new sig...
Are those 20 billion documents, the actual SPAMs I received at my yahoo mail account since 1994?
- useless blogs and geocities "websites": 12 billion
- clipart, midi and hideous backgrounds for above websites: 6 billion
- links to outdated or expired user sessions: 1 billion
- real content: 1 billion, if lucky
The only thing I ever use Yahoo for is if and when my internet connection seems slow or dead I ping yahoo.com. It's just been a habit since the 90's.I agree about Froogle. Usually over 90% of all items can't be ordered by price even though the engine was clearly able to determine what the price was. How is it being froogle if you can't easily figure out which is the cheapest?
The Yahoo! crawler (Slurp) is definitely more aggressive than the Googlebot. It comes knocking on my door several times a day, especially the blog pages. Google is more conservative and keeps things in a sandbox, too.
Results:
Google: "1-10 of about 3,120,000,000 .06 sec"
.08 sec"
Yahoo: "1-10 of about 11,300,000,000
Top yahoo hit - some punk band. Top Google hit, apple .com.
Gee, who do you think will make more money with those results... ;-)
This issue is a bit more complicated than you think.
The increase can be explained by Yahoo adding Slashdot dupes to their index.
I've spent the last few days doing some very important searching - we're thinking about launching a new product in a rather arcane field, and I wanted to be absolutely certain who the potential competition might be - hence I decided to search both Google & Yahoo!.
Guess what? Yahoo! search beats Google search, hands down. Not even close.
Two thoughts:
Nonsense.
Search: Google's Pagerank concept radically changed the way that search engines determined which results were relevant. While previous services were based on human rankings or on how many times a particular word was listed on the page, Google put out an automated system which was able to deliver more relevant results when confronted with normal sites and, by its very design, much harder to exploit with SEO techniques. Further, Google continually tweaks the parameters of their search -- if you can go to one of Norvig's talks about the sorts of stuff they do, it's amazing.
Maps: That interface -- scrolling, markers, and all -- is done entirely in javascript. No plugins, no flash, no helpers. Nobody thought that that sort of thing was even possible.
GMail: I don't use it, so I can't comment. But I do have around 1 GB of email on my primary account. When you use email for serious work, it can add up.
Google Groups: It's my group reader. I like it because it shows the discussions in thread format from the top and supresses the quoting that can make USENET discussions turn into pages and pages of greater-than symbols.
As to your assertion that Google hasn't ushered in a new age, I disagree. Ten years ago, when someone wanted information they went to a library, an encyclopedia, or maybe a CD-ROM. Now, any time anyone wants to know anything, they go immediately to Google and chances are that the information will come up on the first page.
Lest you've forgotten, it was Napster and Winamp that 0popularized mp3's, not the iPod, and COBOL, not Oracle, that popularized the database. So I'd respond to you, "Stop the misinformation campaign."
I used to read Caltizzle. I was a lot cooler than you.
... how come no one is?
Where else can I find the likes of Y! Calender / Mail / Address book, all integrated, for free? Point me there and I might jump ship.
GMail is great for email, but it's address book is a POS, and there is no calendering whatsoever. Meanwhile, over at Y!, I have a calender that not only shows me the weather forecast for the week embedded into it, but it also issues me reminder notices via Y! IM for important dates.
Not to mention the vast usefulness of other Y! services like Launch! and Y! Photos.
Google may be leading the way as far as search, maps, and email goes, but for other services, *they* are the ones playing catch-up. For example, see their "Customized" home page, which http://my.yahoo.com/ had beat about 3 years ago.
In other news, some guy wrote a Slashdot post and linked to his own blog. Woah, shocker. 9_9
I've noticed that Yahoo's crawl visits my site more frequently...but Google's crawl seems to be intelligent about how often it crawls.
If I update alot, google crawls more. Yahoo doesn't seem to care.
So all these folks talking about yahoo being better may be off the mark. Why crawl all the time when you can only crawl when necessary?
So.. Yahoo is mature and Google is not because Google's news service reprints many and varied websites-- but not some of the "blogs" you like-- and Yahoo's news service reprints Reuters? I'm not entirely sure what's going on here but it sounds like you are misinterpreting some kind of personal poor experience with Google's sales department as an actual problem.
Google and Yahoo news do not even offer remotely the same kind of service, nor are the services equal in importance. Yahoo News is almost closer to the core of Yahoo's service than even the search; Google News is more auxiliary from Google's perspective, and I don't think they're even getting much money off of them.
Anyway, frankly IMO "blogs" shouldn't be on google news anyway. Period. If I wanted a blog aggregator, I'd go to a blog aggregator. Google News is a news aggregator. The difference may mostly be only in terms of what the aggregated sites choose to identify themselves as, but that's enough of a difference for me.
As for AdSense, the categories based on which things can get classified as inappropriate for AdSense are extremely broad and if you're expecting close attention paid to border cases, I think you're expecting things of the service that the service never intended. And if the person your complaint here concerns is Michelle Malkin...? Well, from what I've read of her stuff, if you're trying to defend her against accusations of racism then some article about Nelson Mandela would be only the tiniest part of the problem.
Don't be surprised if in a few more years of broadband development, that Yahoo is able to position itself as an alternative to many cable TV providers.
Wait, wasn't this exact same prediction being batted around, like, five to seven years ago? And didn't it fail to work out then either? Hm, you are a blogger, aren't you.
Irritable, left-wing and possibly humorous bumper stickers and t-shirts
So, In Firefox tab A, I have Google and tab B is Yahoo. Both searched on Kyzyl.
Results (pleae pay attention because htmling this was a pain...):
Yahoo's first 5 entries:
* All Russia Hotels All Russian Hotels - We offer discount hotel reservation services online in Moscow, St. Petersburg, Kiev, Russia, Ukraine, CIS and Baltic. www.allrussiahotels.com
* Tuva Travel Kyzyl city is the capital of Tuva Republic (Russia) Kyzyl city is positioned right in the center of Asia, which is proudly claimed by a local monument specifically dedicated to this fact. www.sokoltours.com
WEB RESULTS
1. Wikipedia: Kyzyl
Open this result in new window
Wikipedia Free Encyclopedia's article on 'Kyzyl' en.wikipedia.org/wiki/Kyzyl
- More from this site - Save - Block
2. Weather Underground: Kyzyl, Russia Forecast ... Updated: 8:00 AM KRAST on August 02, 2005. Observed at Kyzyl, Russia (History) Elevation: 2064 ft / 629 m ... Coming soon: Flash Stickers. Kyzyl, 63 F / 17 C ...
Open this result in new window Find the Weather for any City, State or ZIP Code, or Airport Code or Country. Email. Password. Maps. United States. International. Information. Refinance Rates. GoTo Meeting. Kyzyl Singles. Hosting Companies. Online deals! Vitamins. Internet Mall
www.wunderground.com/global/stations/36096.html
- 64k - Cached - More from this site - Save - Block
3. AllRefer.com - Kyzyl (CIS And Baltic Political Geography) - Encyclopedia
Open this result in new window
3. AllRefer.com reference and encyclopedia resource provides complete information on Kyzyl, CIS And Baltic Political Geography. Includes related research links. ... By Alphabet : Encyclopedia A-Z - K. Kyzyl, CIS And Baltic Political Geography ... Kyzyl or Kizil[both: kizil'] Pronunciation Key, city (1989 pop ...
reference.allrefer.com/encyclopedia/K/Kyzyl
More from this site - Save - Block
Now, for the first five Google Results on Kyzyl:
Kyzyl'-administrative center of Republic of Tuva, Russia Kyzyl' Republic of Tuva, ... Republic Capital:, Kyzyl. Capital Population:, 91000( at 01/01/94) ...
|Central-Chernozemny|
members.tripod.com/~argun/kyzyl.htm
- 5k - Cached - Similar pages
Kyzyl on Encyclopedia.com ...
Kyzyl or Kizilboth: kzl, city (1989 pop. 85000), capital of Tuva Republic, S Siberian Russia, on the Yenisei River. It services motor transport and has
www.encyclopedia.com/html/K/Kyzyl.asp
- 47k - Cached - Similar pages
Kyzyl Travel Information. Photos, Stories and Diaries about Kyzyl
Sustainable Tourism for independent travellers (travelers) and backpackers. www.worldsurface.com/browse/location.asp?locationi d=5654
- 59k - Cached - Similar pages
Kyzyl, Tuva, Russia current local time ...
Kyzyl, Tuva, Russia - before placing a telephone call or making travel plans for a flight or hotel, get the current local time provided by
www.worldtimeserver.com/current_time_in_RU-TY.aspx ?city=Kyzyl
- 17k - C
Shoes for Industry. Shoes for the Dead.
For popular search terms (queries with millions of hits) index size doesn't matter much. Yahoo, google, ask, msn etc all produce pretty similar results (that tend to favor established sites/pages.) For rare terms or combinations, which contribute to the Long Tail of web search, index size is very important. Both Yahoo and Google report estimated (often inflated) hits for popular terms and exact numbers for rare terms, which still include dups. You need to go to the last result page to find out the exact non-dup number, which sometimes can shrink the de-dup'ed hits by a factor of 10. Let's see how the new yahoo fairs against google with a few queries I picked randomly:
Yahoo used to consistently underperform google on rare terms, it seems they indeed have caught up. But it has NOT really exceeded google in terms of useful size (Yahoo has more dups.) Still, it's a worthy engineering effort. Congrats!
The problem is the difference between raw data and useful information.
When you look through a list of restaurants (or the list of anything in the yellow pages), you're looking at something put together based on _semantics_. Some human put that list together and made sure the _meaning_ is what you'd expect there: you can actually drive to one of those locations and order food.
Search engines, on the other hand, just look at the words and have no bloody clue of semantics.
If someone ever put together a list of restaurants, it would just be a list of all people who ever said the word "restaurant". Including everyone who ever said "I hate chinese restaurants" or "I took my gf to a restaurant" or "I went to see a new apartment, but it was above a restaurant" or whatever. Needless to say, driving to most of those locations would be a bloody useless exercise.
Adding another 20 million people to that kind of indexing would just raise the noise-to-signal ratio, not actually produce anything useful.
A polar bear is a cartesian bear after a coordinate transform.
Google refuses to index pages that aren't linked to by at least a gazillion other sites, submitted or not.
My site, for example, has been up and running for nearly two months, submitted a few times and actually linked to by a few pages that are indexed by Google but it still doesn't appear *at all* in Googles index, not even far in the bottom.
Even if you enter site:www.....com in the search bar directly, it just says it doesn't know it. At least Yahoo has got it in there, never mind high ranked or not.
Visit http://ringbreak.dnd.utwente.nl/~mrjb/growingbettersoftware to download your free copy of the book