Google's Bigger Index

Their search has apparently improved as well ! by phoxix · 2004-02-17 04:27 · Score: 4, Informative

Search for any normal product name with google. What would you used to get ? Billions of useless sites that cross link to each other and have the same bloody reviews from amazon.com

That seems to have changed!

I just tried a search on television antennas and for once the results seem relevent.

Hooray!! Google is back!! :^)

Sunny Dubey

They said 6 billion items, not webpages. by LostCluster · 2004-02-17 04:28 · Score: 5, Informative

Notice that they claim that they search 6 billion items, but the home page only claims that they're "Searching 4,285,199,774 web pages".

To find the rest, we need to use Google's other services. The image search is claiming "Searching 880,000,000 images". Google Groups says its "Searching 845,000,000 messages". Add those to the count and you get 6,010,199,744 items total.

Re:how many? by Anonymous Coward · 2004-02-17 04:28 · Score: 5, Informative

That sort of search result spamming is getting out of hand.

Maybe if more people used Google's Search Quality feedback form, it would help weed them out.

Google Print by blorg · 2004-02-17 04:33 · Score: 5, Informative

"Google's collection of 6 billion items comprises 4.28 billion web pages, 880 million images, 845 million Usenet messages, and a growing collection of book-related information pages."

I was interested that they mentioned Google Print, which is Google's answer to Amazon's Search Inside feature, but hasn't got much press, and is pretty well hidden in Google itself.

You can check it out by limiting results to site print.google.com, e.g. searchterm site:print.google.com. (Not quite at Amazon-type numbers yet.)

Re:Still nok by happystink · 2004-02-17 04:33 · Score: 2, Informative

Just check the IPs googlebot comes from and ban those if they're not honoring your roots file, that works fine, they have a very set range they use, anything starting with 216.39 or something I think.

--

sig:
See the "..for smart people" banners Wired runs here? Look elsewhere guys.

Is /. pro Google? by dark-br · 2004-02-17 04:39 · Score: 5, Informative

"Google currently does not allow outsiders to gain access to raw data because of privacy concerns. Searches are logged by time of day, originating I.P. address (information that can be used to link searches to a specific computer), and the sites on which the user clicked. People tell things to search engines that they would never talk about publicly -- Viagra, pregnancy scares, fraud, face lifts. What is interesting in the aggregate can seem an invasion of privacy if narrowed to an individual."

That's a quote from the NYtimes (free req. yada yada) also posted as is here

If any other site were to track the stuff Google does, /. would be up in arms protesting!

Please note, this isn't a troll, and I'm not wearing a tin-foil hat (maybe I should?). Imagine the following scenario: a bomb goes off in the US. By tracing searches for "anarchist cookbook" to zipcodes within the area of the bomb blast, the FBI could have access to information that makes TIA look like a better alternative.

Maybe this isn't such a good feature after all...

Re:What I want to know... by ctishman · 2004-02-17 04:40 · Score: 5, Informative

Use that "Dissatisfied with your search results? Help us improve." link at the bottom of the page. Voila.

Re:Good for Google...but: by BenjyD · 2004-02-17 04:42 · Score: 2, Informative

That's what directories like dmoz.org do. IIRC, google does use directory information, but it is far too hard a problem to automate topic finding without a lot of human editors.
I saw some research recently at a conference that used complex vocabulary matching algorithms to automatically extract topics and organise large numbers of documents into topic hierachies and present summary reports, but I think that might be a bit too processor intensive and cutting edge, even for google.

Re:What I want to know... by Chris+Croome · 2004-02-17 04:43 · Score: 3, Informative

...is how to get rid of those pseudo-pages in Google. The ones with names like "thing_that_youre_searching_for.html", and all they are is either a page of dead links to crap on ebay, or a "Hey, we do great searches for your stuff".

+1

There are things that you just can't use Google for any more becaues these googlespam sites score so well... it's like being back in the days before google...

--
Check out MKDoc a mod_perl CMS

It's worth mentioning... by dark-br · 2004-02-17 04:45 · Score: 4, Informative

that not everything about Google is so visible.

One shuold have a look at Google-Watch (tinfoil? maybe...) but they have some good points:

According to DEA, Google is breaking the law

Google Evil cookie

We got your number!

And so on...

Not to troll but rather a thought. Mod as you wish.

Re:It's worth mentioning... by Comsn · 2004-02-17 07:35 · Score: 2, Informative

One should also have a look at Google-Watch-Watch

which states

Meet Daniel Brandt. He is a self-proclaimed public interest activist and the owner of Google-Watch.org Mr. Brandt founded Google-Watch.org after his own site, Namebase.org, did not get a good Google PageRank.

Re:No Good... by glinden · 2004-02-17 04:48 · Score: 4, Informative

I want it to return more relevant searches.

Have you tried some of the Google alternatives? Vivisimo is particularly interesting with its clustering of search results. Teoma is also quite good.

big but far from complete. by selderrr · 2004-02-17 04:52 · Score: 4, Informative

I wrote a project for our univ and submitted the url to google bout 3 moths ago. It still doesn't show up

--
When will I end this grieving ? When will my future begin ?

Mac users' image search by saddino · 2004-02-17 05:08 · Score: 4, Informative

"Google Image Search has been significantly updated," said Sergey Brin, Google co-founder and president of Technology. "We've doubled the index to more than 880 million images, enhanced search quality, and improved the user interface."

For Mac users, I recommend using Beholder to power your Google image search. Google's minimal UI changes notwithstanding.

(Mod +1 Self-Promotive)

Re:Still nok by bad-badtz-maru · 2004-02-17 05:18 · Score: 3, Informative

If googlebot crawls your site, then your robots.txt file is either wrong or in the wrong location. There is no doubt that googlebot follows the robots.txt standard.

It can take a very long time for a site to be spidered after it is submitted via the "add a url" form.

Google has a page about this... by SilentT · 2004-02-17 05:21 · Score: 2, Informative

Go here for instructions on removal from their index.

Google alternatives: Gigablast by MikeCapone · 2004-02-17 05:36 · Score: 2, Informative

My favourite right now is GigaBlast.

It's still smaller than most other search engines, but it's quite fast, has good relevance and it indexes stuff in real time.

Besides, if you don't find what you are looking, you can do the same search with 5 other search engines just by clicking on links at the bottom of the results page.

But what I like with Gigablast is that it's always getting better and I feel like part of something that has potential.

--
Treehugger? Treehugger... Treehugger!

Re:How much space do they use for caching? by dildatron · 2004-02-17 05:47 · Score: 2, Informative

I'm a storage engineer, and, to the enterprise, 30TB is peanuts. On a busy day, I have provisioned 30TB in one day to various computers. A typical high-end array (an EMC/Hitachi/HP/etc)usually tops out at around 150TB, but you can have a bunch of them on the same storage area network.

The trick, is how to back it all up in shortening backup windows. Things like truecopy work, but take twice the disk space.

--

If you had nuts on your chin, would they be chin nuts?

Re:Run out of indexing space? by dtfinch · 2004-02-17 06:18 · Score: 2, Informative

Since they said they have 4.28 billion searchable pages in the index, and 32 bit integers have a range of about 4.29 billion possible values, I'd say they're pretty close to having to make another upgrade, unless they decide there will never be more than 4.29 billion pages online that searchers would be interested in.

Re:Here's hoping by thestarz · 2004-02-17 06:36 · Score: 3, Informative

Yes, you are missing something. They have reached 6 billion items, only 4 billion of those are web pages, the rest are pictures, usenet messages, etc. RTFA!

--

c++; /* this makes c bigger but returns the old value */

sco fell in "litigious bastard" search. by morcheeba · 2004-02-17 07:12 · Score: 2, Informative

When you search for "litigious bastards", you now get a website promoting the googlebomb technique listed first. The sco group was listed first, but now it's ranked about 47. I'm not sure if they are reducing the relevance of the link-text, or if the ranking has been lowered because the sco group probably doesn't point back at any of the blogs that link to it.

--
HIV Crosses Species Barrier... into Muppets

That is suspiciously close by K-Man · 2004-02-17 07:55 · Score: 2, Informative

It's probably not a big deal to expand the capacity, but it certainly looks like it's pegged to 2^32 for this release.

--
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger

For those of you who were wondering/complaining... by Afromelonhead · 2004-02-17 08:44 · Score: 3, Informative

According to Google's cache of Google, there used to be only 3,307,998,701 pages in their index, as opposed to the 4,285,199,774 (as of writing) in the index.

It's also interesting to note that both have a copyright date of 2004, which would imply that Google has found just under 1 billion websites in a month and a half, which seems like an interesting fact.

--
Procrastination sucks.

Even better way to report by delfstrom · 2004-02-17 09:54 · Score: 4, Informative

The "help us improve" link is okay, but a little general. Most of us slashdot readers know when a search result is truly bogus, and there's a more advanced form we can use for reporting abusers directly:

http://www.google.com/contact/spamreport.html

This will give you options of reporting cloaked pages, doorway pages, deceptive redirects, misleading or repeated words, hidden text, etc. You have to be more specific than the "help us improve" link at the bottom of search results. Using this form I've seen abusive sites disappear from Google's index in less than 12 hours.

Re:Run out of indexing space? by kindofblue · 2004-02-17 10:23 · Score: 2, Informative

Hypothetically, if web pages were identified with an 8-bit code of 0x01 along with a 32-bit identifier, then one could just assign another code to signify web pages. e.g. codes 0x00-0x7f could be web page codes, 0x80 for PDFs, 0x81 for Gifs, etc. Each code would be combined with a 32-bit int identifier that is unique relative to that code, giving a 40-bit identifier space.

As for the space required, they must have gone to beyond 32-bits for on-disk identifiers. URL's and cached pages easily take a lot more space than a 5-byte to 8-byte (64-bits) identifier, so they've definately got the storage. For archival purposes, 64-bits is ample space and small.

But a good reason to keep identifier sizes small is so that they don't take up much RAM space. That's why variable sized IDs would be useful. They are a simple fast form of compression. UTF8 is a variable sized encoding that uses 8 bits to encode the vast majority of characters used in English (ASCII) and uses between 2-bytes and 4-bytes for other less common character codes (symbols and other language characters). This is done by using the top 2-bits of the first byte to indicate how large that variable-sized character is. (I don't remember the details, however.) The effect is that on average for English, most strings would consume slightly more than 8 bits per character.

The same principle would work for any variable sized identifier, e.g. useful for DOC Ids or word/term ids. The most common web pages (yahoo, hotmail, msn, nytimes, etc) would have very high page rank and could be given small ids', eg. 16-bits (2-bit code, 14-bit id). Same thing for frequent words, "whether", "while", "with", "over", or closed-class words. Compress them to small ids.

Anyway the point is that you could have an effective id space of much greater than 32 bits and yet use much less than 32-bits per identifier on average. Every search engine must have dispensed with the 32-bit barrier by their beta phase, unless they're run by idiots. Maybe that's Microsoft's problem.

25 of 412 comments (clear)