Google's Bigger Index

Their search has apparently improved as well ! by phoxix · 2004-02-17 04:27 · Score: 4, Informative

Search for any normal product name with google. What would you used to get ? Billions of useless sites that cross link to each other and have the same bloody reviews from amazon.com

That seems to have changed!

I just tried a search on television antennas and for once the results seem relevent.

Hooray!! Google is back!! :^)

Sunny Dubey

They said 6 billion items, not webpages. by LostCluster · 2004-02-17 04:28 · Score: 5, Informative

Notice that they claim that they search 6 billion items, but the home page only claims that they're "Searching 4,285,199,774 web pages".

To find the rest, we need to use Google's other services. The image search is claiming "Searching 880,000,000 images". Google Groups says its "Searching 845,000,000 messages". Add those to the count and you get 6,010,199,744 items total.

Re:how many? by Anonymous Coward · 2004-02-17 04:28 · Score: 5, Informative

That sort of search result spamming is getting out of hand.

Maybe if more people used Google's Search Quality feedback form, it would help weed them out.

Re:how many? by Anonymous Coward · 2004-02-17 04:30 · Score: 1, Informative

Your google is broken. Mine gets me a PDF of a wave-shaper circuit layout see

Google Print by blorg · 2004-02-17 04:33 · Score: 5, Informative

"Google's collection of 6 billion items comprises 4.28 billion web pages, 880 million images, 845 million Usenet messages, and a growing collection of book-related information pages."

I was interested that they mentioned Google Print, which is Google's answer to Amazon's Search Inside feature, but hasn't got much press, and is pretty well hidden in Google itself.

You can check it out by limiting results to site print.google.com, e.g. searchterm site:print.google.com. (Not quite at Amazon-type numbers yet.)

Re:Still nok by happystink · 2004-02-17 04:33 · Score: 2, Informative

Just check the IPs googlebot comes from and ban those if they're not honoring your roots file, that works fine, they have a very set range they use, anything starting with 216.39 or something I think.

--

sig:
See the "..for smart people" banners Wired runs here? Look elsewhere guys.

Is /. pro Google? by dark-br · 2004-02-17 04:39 · Score: 5, Informative

"Google currently does not allow outsiders to gain access to raw data because of privacy concerns. Searches are logged by time of day, originating I.P. address (information that can be used to link searches to a specific computer), and the sites on which the user clicked. People tell things to search engines that they would never talk about publicly -- Viagra, pregnancy scares, fraud, face lifts. What is interesting in the aggregate can seem an invasion of privacy if narrowed to an individual."

That's a quote from the NYtimes (free req. yada yada) also posted as is here

If any other site were to track the stuff Google does, /. would be up in arms protesting!

Please note, this isn't a troll, and I'm not wearing a tin-foil hat (maybe I should?). Imagine the following scenario: a bomb goes off in the US. By tracing searches for "anarchist cookbook" to zipcodes within the area of the bomb blast, the FBI could have access to information that makes TIA look like a better alternative.

Maybe this isn't such a good feature after all...

Re:What I want to know... by ctishman · 2004-02-17 04:40 · Score: 5, Informative

Use that "Dissatisfied with your search results? Help us improve." link at the bottom of the page. Voila.

Re:Good for Google...but: by BenjyD · 2004-02-17 04:42 · Score: 2, Informative

That's what directories like dmoz.org do. IIRC, google does use directory information, but it is far too hard a problem to automate topic finding without a lot of human editors.
I saw some research recently at a conference that used complex vocabulary matching algorithms to automatically extract topics and organise large numbers of documents into topic hierachies and present summary reports, but I think that might be a bit too processor intensive and cutting edge, even for google.

Re:What I want to know... by Chris+Croome · 2004-02-17 04:43 · Score: 3, Informative

...is how to get rid of those pseudo-pages in Google. The ones with names like "thing_that_youre_searching_for.html", and all they are is either a page of dead links to crap on ebay, or a "Hey, we do great searches for your stuff".

+1

There are things that you just can't use Google for any more becaues these googlespam sites score so well... it's like being back in the days before google...

--
Check out MKDoc a mod_perl CMS

It's worth mentioning... by dark-br · 2004-02-17 04:45 · Score: 4, Informative

that not everything about Google is so visible.

One shuold have a look at Google-Watch (tinfoil? maybe...) but they have some good points:

According to DEA, Google is breaking the law

Google Evil cookie

We got your number!

And so on...

Not to troll but rather a thought. Mod as you wish.

Re:It's worth mentioning... by Comsn · 2004-02-17 07:35 · Score: 2, Informative

One should also have a look at Google-Watch-Watch

which states

Meet Daniel Brandt. He is a self-proclaimed public interest activist and the owner of Google-Watch.org Mr. Brandt founded Google-Watch.org after his own site, Namebase.org, did not get a good Google PageRank.

I doubt it by Aqua+OS+X · 2004-02-17 04:47 · Score: 1, Informative

I doubt it. Google may have more things indexed, but it web search still sucks when compared to Teoma'a and it's image search still sucks when compared to AllTheWeb's.

Google is most non triumphant.

--
"Things are more moderner than before- bigger, and yet smaller- it's computers-- San Dimas High School football RULES!"

Re:No Good... by glinden · 2004-02-17 04:48 · Score: 4, Informative

I want it to return more relevant searches.

Have you tried some of the Google alternatives? Vivisimo is particularly interesting with its clustering of search results. Teoma is also quite good.

big but far from complete. by selderrr · 2004-02-17 04:52 · Score: 4, Informative

I wrote a project for our univ and submitted the url to google bout 3 moths ago. It still doesn't show up

--
When will I end this grieving ? When will my future begin ?

search indexing by stefanmi · 2004-02-17 05:07 · Score: 0, Informative

Also one of the main problems Google is currently having with their search results is that too many blogs are ending up in the top results, often ranking higher than the primary site that contains the information that the blogs refer to (due to many blog-users who heavily cross-linking amongst themselves which ups their rating). To combat this they've already discussed creating a seperate category for blogs to help seperate these. Good to see them taking a proactive stance -- get enough people using your service and you're suddenly got a category of blogs already identified and indexed. I'm giving them the benefit of the doubt as they've always been quite responsible with ads and while its a potential revenue stream I don't think they'll ever be as intrusive as other free sites/services.

Mac users' image search by saddino · 2004-02-17 05:08 · Score: 4, Informative

"Google Image Search has been significantly updated," said Sergey Brin, Google co-founder and president of Technology. "We've doubled the index to more than 880 million images, enhanced search quality, and improved the user interface."

For Mac users, I recommend using Beholder to power your Google image search. Google's minimal UI changes notwithstanding.

(Mod +1 Self-Promotive)

Re:Still nok by bad-badtz-maru · 2004-02-17 05:18 · Score: 3, Informative

If googlebot crawls your site, then your robots.txt file is either wrong or in the wrong location. There is no doubt that googlebot follows the robots.txt standard.

It can take a very long time for a site to be spidered after it is submitted via the "add a url" form.

Google has a page about this... by SilentT · 2004-02-17 05:21 · Score: 2, Informative

Go here for instructions on removal from their index.

Sturgeon's Law by sarastro_us · 2004-02-17 05:28 · Score: 1, Informative

90% of everything is crap...

Google alternatives: Gigablast by MikeCapone · 2004-02-17 05:36 · Score: 2, Informative

My favourite right now is GigaBlast.

It's still smaller than most other search engines, but it's quite fast, has good relevance and it indexes stuff in real time.

Besides, if you don't find what you are looking, you can do the same search with 5 other search engines just by clicking on links at the bottom of the results page.

But what I like with Gigablast is that it's always getting better and I feel like part of something that has potential.

--
Treehugger? Treehugger... Treehugger!

Re:How much space do they use for caching? by dildatron · 2004-02-17 05:47 · Score: 2, Informative

I'm a storage engineer, and, to the enterprise, 30TB is peanuts. On a busy day, I have provisioned 30TB in one day to various computers. A typical high-end array (an EMC/Hitachi/HP/etc)usually tops out at around 150TB, but you can have a bunch of them on the same storage area network.

The trick, is how to back it all up in shortening backup windows. Things like truecopy work, but take twice the disk space.

--

If you had nuts on your chin, would they be chin nuts?

Re:Run out of indexing space? by dtfinch · 2004-02-17 06:18 · Score: 2, Informative

Since they said they have 4.28 billion searchable pages in the index, and 32 bit integers have a range of about 4.29 billion possible values, I'd say they're pretty close to having to make another upgrade, unless they decide there will never be more than 4.29 billion pages online that searchers would be interested in.

Re:Here's hoping by thestarz · 2004-02-17 06:36 · Score: 3, Informative

Yes, you are missing something. They have reached 6 billion items, only 4 billion of those are web pages, the rest are pictures, usenet messages, etc. RTFA!

--

c++; /* this makes c bigger but returns the old value */

sco fell in "litigious bastard" search. by morcheeba · 2004-02-17 07:12 · Score: 2, Informative

When you search for "litigious bastards", you now get a website promoting the googlebomb technique listed first. The sco group was listed first, but now it's ranked about 47. I'm not sure if they are reducing the relevance of the link-text, or if the ranking has been lowered because the sco group probably doesn't point back at any of the blogs that link to it.

--
HIV Crosses Species Barrier... into Muppets

That is suspiciously close by K-Man · 2004-02-17 07:55 · Score: 2, Informative

It's probably not a big deal to expand the capacity, but it certainly looks like it's pegged to 2^32 for this release.

--
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger

Re:Most press-release like post ever by Anonymous Coward · 2004-02-17 08:19 · Score: 1, Informative

"From 3.4bn to 6bn"

That number will likely exceed 10 billion in the near future. Some Google projects are resource constrained, which is astounding considering that the company's computational resources are actually *greater* than publicly disclosed. The scale of the operation is something that most people (in IT, or otherwise) can hardly imagine. Suffice to say that Google is unusual in that marketing people routinely *understate* the numbers that competitors would gleefully overstate.

It is disturbing that no one, not even Microsoft, may be able to catch up to Google for quite some time, simply because of the orchestrated efficiency of Google's processes and the scale of the deployed infrastructure (sorry, I cannot offer any more specifics, it is in my NDA). That is not good for competition, especially with pond-scum word-spammers and useless blog fluff posing a structural challenge to PageRank.

Society could do worse than having a Googleopoly on search. Google is run by good people (ask anyone who works for Google, or any of Google's vendors) and puts a lot of effort into doing the Right Things. Nonetheless, healthy competition is preferable to a comfortable stagnation.

For those of you who were wondering/complaining... by Afromelonhead · 2004-02-17 08:44 · Score: 3, Informative

According to Google's cache of Google, there used to be only 3,307,998,701 pages in their index, as opposed to the 4,285,199,774 (as of writing) in the index.

It's also interesting to note that both have a copyright date of 2004, which would imply that Google has found just under 1 billion websites in a month and a half, which seems like an interesting fact.

--
Procrastination sucks.

Even better way to report by delfstrom · 2004-02-17 09:54 · Score: 4, Informative

The "help us improve" link is okay, but a little general. Most of us slashdot readers know when a search result is truly bogus, and there's a more advanced form we can use for reporting abusers directly:

http://www.google.com/contact/spamreport.html

This will give you options of reporting cloaked pages, doorway pages, deceptive redirects, misleading or repeated words, hidden text, etc. You have to be more specific than the "help us improve" link at the bottom of search results. Using this form I've seen abusive sites disappear from Google's index in less than 12 hours.

Re:Run out of indexing space? by kindofblue · 2004-02-17 10:23 · Score: 2, Informative

Hypothetically, if web pages were identified with an 8-bit code of 0x01 along with a 32-bit identifier, then one could just assign another code to signify web pages. e.g. codes 0x00-0x7f could be web page codes, 0x80 for PDFs, 0x81 for Gifs, etc. Each code would be combined with a 32-bit int identifier that is unique relative to that code, giving a 40-bit identifier space.

As for the space required, they must have gone to beyond 32-bits for on-disk identifiers. URL's and cached pages easily take a lot more space than a 5-byte to 8-byte (64-bits) identifier, so they've definately got the storage. For archival purposes, 64-bits is ample space and small.

But a good reason to keep identifier sizes small is so that they don't take up much RAM space. That's why variable sized IDs would be useful. They are a simple fast form of compression. UTF8 is a variable sized encoding that uses 8 bits to encode the vast majority of characters used in English (ASCII) and uses between 2-bytes and 4-bytes for other less common character codes (symbols and other language characters). This is done by using the top 2-bits of the first byte to indicate how large that variable-sized character is. (I don't remember the details, however.) The effect is that on average for English, most strings would consume slightly more than 8 bits per character.

The same principle would work for any variable sized identifier, e.g. useful for DOC Ids or word/term ids. The most common web pages (yahoo, hotmail, msn, nytimes, etc) would have very high page rank and could be given small ids', eg. 16-bits (2-bit code, 14-bit id). Same thing for frequent words, "whether", "while", "with", "over", or closed-class words. Compress them to small ids.

Anyway the point is that you could have an effective id space of much greater than 32 bits and yet use much less than 32-bits per identifier on average. Every search engine must have dispensed with the 32-bit barrier by their beta phase, unless they're run by idiots. Maybe that's Microsoft's problem.

30 of 412 comments (clear)