Google's Bigger Index
WebGangsta writes "Google Inc. today announced it expanded the breadth of its web index to more than 6 billion items. This innovation represents a milestone for Internet users, enabling quick and easy access to the world's largest collection of online information."
Search for any normal product name with google. What would you used to get ? Billions of useless sites that cross link to each other and have the same bloody reviews from amazon.com
:^)
That seems to have changed!
I just tried a search on television antennas and for once the results seem relevent.
Hooray!! Google is back!!
Sunny Dubey
Notice that they claim that they search 6 billion items, but the home page only claims that they're "Searching 4,285,199,774 web pages".
To find the rest, we need to use Google's other services. The image search is claiming "Searching 880,000,000 images". Google Groups says its "Searching 845,000,000 messages". Add those to the count and you get 6,010,199,744 items total.
Maybe if more people used Google's Search Quality feedback form, it would help weed them out.
Your google is broken. Mine gets me a PDF of a wave-shaper circuit layout see
I was interested that they mentioned Google Print, which is Google's answer to Amazon's Search Inside feature, but hasn't got much press, and is pretty well hidden in Google itself.
You can check it out by limiting results to site print.google.com, e.g. searchterm site:print.google.com. (Not quite at Amazon-type numbers yet.)
Just check the IPs googlebot comes from and ban those if they're not honoring your roots file, that works fine, they have a very set range they use, anything starting with 216.39 or something I think.
sig:
See the "..for smart people" banners Wired runs here? Look elsewhere guys.
That's a quote from the NYtimes (free req. yada yada) also posted as is here
If any other site were to track the stuff Google does,
Please note, this isn't a troll, and I'm not wearing a tin-foil hat (maybe I should?). Imagine the following scenario: a bomb goes off in the US. By tracing searches for "anarchist cookbook" to zipcodes within the area of the bomb blast, the FBI could have access to information that makes TIA look like a better alternative.
Maybe this isn't such a good feature after all...
Use that "Dissatisfied with your search results? Help us improve." link at the bottom of the page. Voila.
That's what directories like dmoz.org do. IIRC, google does use directory information, but it is far too hard a problem to automate topic finding without a lot of human editors.
I saw some research recently at a conference that used complex vocabulary matching algorithms to automatically extract topics and organise large numbers of documents into topic hierachies and present summary reports, but I think that might be a bit too processor intensive and cutting edge, even for google.
+1
There are things that you just can't use Google for any more becaues these googlespam sites score so well... it's like being back in the days before google...
Check out MKDoc a mod_perl CMS
One shuold have a look at Google-Watch (tinfoil? maybe...) but they have some good points:
According to DEA, Google is breaking the law
Google Evil cookie
We got your number!
And so on...
Not to troll but rather a thought. Mod as you wish.
I doubt it. Google may have more things indexed, but it web search still sucks when compared to Teoma'a and it's image search still sucks when compared to AllTheWeb's.
Google is most non triumphant.
"Things are more moderner than before- bigger, and yet smaller- it's computers-- San Dimas High School football RULES!"
- I want it to return more relevant searches.
Have you tried some of the Google alternatives? Vivisimo is particularly interesting with its clustering of search results. Teoma is also quite good.I wrote a project for our univ and submitted the url to google bout 3 moths ago. It still doesn't show up
When will I end this grieving ? When will my future begin ?
Also one of the main problems Google is currently having with their search results is that too many blogs are ending up in the top results, often ranking higher than the primary site that contains the information that the blogs refer to (due to many blog-users who heavily cross-linking amongst themselves which ups their rating). To combat this they've already discussed creating a seperate category for blogs to help seperate these. Good to see them taking a proactive stance -- get enough people using your service and you're suddenly got a category of blogs already identified and indexed. I'm giving them the benefit of the doubt as they've always been quite responsible with ads and while its a potential revenue stream I don't think they'll ever be as intrusive as other free sites/services.
"Google Image Search has been significantly updated," said Sergey Brin, Google co-founder and president of Technology. "We've doubled the index to more than 880 million images, enhanced search quality, and improved the user interface."
For Mac users, I recommend using Beholder to power your Google image search. Google's minimal UI changes notwithstanding.
(Mod +1 Self-Promotive)
If googlebot crawls your site, then your robots.txt file is either wrong or in the wrong location. There is no doubt that googlebot follows the robots.txt standard.
It can take a very long time for a site to be spidered after it is submitted via the "add a url" form.
Go here for instructions on removal from their index.
90% of everything is crap...
My favourite right now is GigaBlast.
It's still smaller than most other search engines, but it's quite fast, has good relevance and it indexes stuff in real time.
Besides, if you don't find what you are looking, you can do the same search with 5 other search engines just by clicking on links at the bottom of the results page.
But what I like with Gigablast is that it's always getting better and I feel like part of something that has potential.
Treehugger? Treehugger... Treehugger!
I'm a storage engineer, and, to the enterprise, 30TB is peanuts. On a busy day, I have provisioned 30TB in one day to various computers. A typical high-end array (an EMC/Hitachi/HP/etc)usually tops out at around 150TB, but you can have a bunch of them on the same storage area network.
The trick, is how to back it all up in shortening backup windows. Things like truecopy work, but take twice the disk space.
If you had nuts on your chin, would they be chin nuts?
Since they said they have 4.28 billion searchable pages in the index, and 32 bit integers have a range of about 4.29 billion possible values, I'd say they're pretty close to having to make another upgrade, unless they decide there will never be more than 4.29 billion pages online that searchers would be interested in.
Yes, you are missing something. They have reached 6 billion items, only 4 billion of those are web pages, the rest are pictures, usenet messages, etc. RTFA!
c++;
When you search for "litigious bastards", you now get a website promoting the googlebomb technique listed first. The sco group was listed first, but now it's ranked about 47. I'm not sure if they are reducing the relevance of the link-text, or if the ranking has been lowered because the sco group probably doesn't point back at any of the blogs that link to it.
HIV Crosses Species Barrier... into Muppets
It's probably not a big deal to expand the capacity, but it certainly looks like it's pegged to 2^32 for this release.
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
"From 3.4bn to 6bn"
That number will likely exceed 10 billion in the near future. Some Google projects are resource constrained, which is astounding considering that the company's computational resources are actually *greater* than publicly disclosed. The scale of the operation is something that most people (in IT, or otherwise) can hardly imagine. Suffice to say that Google is unusual in that marketing people routinely *understate* the numbers that competitors would gleefully overstate.
It is disturbing that no one, not even Microsoft, may be able to catch up to Google for quite some time, simply because of the orchestrated efficiency of Google's processes and the scale of the deployed infrastructure (sorry, I cannot offer any more specifics, it is in my NDA). That is not good for competition, especially with pond-scum word-spammers and useless blog fluff posing a structural challenge to PageRank.
Society could do worse than having a Googleopoly on search. Google is run by good people (ask anyone who works for Google, or any of Google's vendors) and puts a lot of effort into doing the Right Things. Nonetheless, healthy competition is preferable to a comfortable stagnation.
It's also interesting to note that both have a copyright date of 2004, which would imply that Google has found just under 1 billion websites in a month and a half, which seems like an interesting fact.
Procrastination sucks.
http://www.google.com/contact/spamreport.html
This will give you options of reporting cloaked pages, doorway pages, deceptive redirects, misleading or repeated words, hidden text, etc. You have to be more specific than the "help us improve" link at the bottom of search results. Using this form I've seen abusive sites disappear from Google's index in less than 12 hours.
As for the space required, they must have gone to beyond 32-bits for on-disk identifiers. URL's and cached pages easily take a lot more space than a 5-byte to 8-byte (64-bits) identifier, so they've definately got the storage. For archival purposes, 64-bits is ample space and small.
But a good reason to keep identifier sizes small is so that they don't take up much RAM space. That's why variable sized IDs would be useful. They are a simple fast form of compression. UTF8 is a variable sized encoding that uses 8 bits to encode the vast majority of characters used in English (ASCII) and uses between 2-bytes and 4-bytes for other less common character codes (symbols and other language characters). This is done by using the top 2-bits of the first byte to indicate how large that variable-sized character is. (I don't remember the details, however.) The effect is that on average for English, most strings would consume slightly more than 8 bits per character.
The same principle would work for any variable sized identifier, e.g. useful for DOC Ids or word/term ids. The most common web pages (yahoo, hotmail, msn, nytimes, etc) would have very high page rank and could be given small ids', eg. 16-bits (2-bit code, 14-bit id). Same thing for frequent words, "whether", "while", "with", "over", or closed-class words. Compress them to small ids.
Anyway the point is that you could have an effective id space of much greater than 32 bits and yet use much less than 32-bits per identifier on average. Every search engine must have dispensed with the 32-bit barrier by their beta phase, unless they're run by idiots. Maybe that's Microsoft's problem.