Google's Technology Explored
RobotWisdom writes "Internetnews offers a moderately detailed peek at Google's technology. For example, they use stripped-down Red Hat on a massively redundant network, and they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page." Additional analysis on InformationWeek and C|Net. From the article: "As a search query comes into the system, it hits a Web server, then is split into chunks of service. One set of index servers contains the index; one set of machines contains one full index. To actually answer a query, Google has to use one complete set of servers. Since that set is replicated as a fail-safe, it also increases throughput, because if one set is busy, a new query can be routed to the next set, which drives down search time per box."
I've been putting movie reviews on my web page for a while now, and I've increasingly noticed that google will point people at them even though they search for stuff that isn't on the page. For example, I've had a number of hits where people search for 'AvP review' (or suchlike) and even though I never include the phrase 'AvP' in my review of Aliens vs Predator.
I was mightily impressed, and not just because it means more people read my stuff. Or at least surf to it.
They should make a googleCluster Live CD.. ala clusterKnoppix.. ..or perhaps use more of clusterKnoppix features or openmosix..share cpu/mem..
sourceforge is begging for something like this..
Their engineer desktops have special google builds of linux which help them compile things insanely fast with g4, ie hacked p4 (Perforce).
They also have one of the best intranet sites I've seen. Lots of info and services the employees can use, apart from email.
The internal blogs really help with keeping track of projects you're not working on, and what others are doing. Their mailing lists are often usefull too, for example there's a lost and found, for sale, and biking partners list. All kinds of usefull little stuff, taking care of the people with little nice things. Lots of reading too.
-- Robi
-- Robi
they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page.
I think what they mean is that they are working on search algorithms that will implement this. Not that they have already made it publicly available. They want it to work first, and be released second. The problem the you have cropping up most likely occurs with pages that put info in the metadata, and hence don't show up in the page itself.
Fly me to the moon Let me sing among those stars Let me see what spring is like On jupiter and mars
It says they're using clustering, so it might help eliminate pages that contain the words you're looking for but aren't relevant to your current query, in addition to including pages that are relevant but don't contain the words. For example,
the word "tree" may either refer to a data structure (binary, B-,red-black etc.) or to the stuff forests are made of. If my query is "search tree", the words search and tree may show up on a page about people searching for some kind of a tree and on pages about search trees. Assuming they're both popular classes of pages, you're going to end up with some mishmash of results from both classes.
Instead, the clustering algorithm might notice (based on other words that appear on the pages, for example) that pages with 'search' and 'tree' in them fall into two classes. That doesn't help if "search tree" is all it has to go by. But now if I add the words "data structure" to the query, it knows which class of pages I'm interested in, because many pages about binary trees contain the words "data structure" whereas almost none about the quest for trees do. Now it can return pages from the right cluester that it knows are relevant, even if they don't contain the word "data structure" in them.
They're not obligated to share unless they are planning on redistributing the software. They are perfectly free to patch their own software and use the patched versions for their servers without sharing those modifications.
The GPL does not force them to do anything unless they wish to redistribute the software.
this is a sig.