Google's Technology Explored
RobotWisdom writes "Internetnews offers a moderately detailed peek at Google's technology. For example, they use stripped-down Red Hat on a massively redundant network, and they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page." Additional analysis on InformationWeek and C|Net. From the article: "As a search query comes into the system, it hits a Web server, then is split into chunks of service. One set of index servers contains the index; one set of machines contains one full index. To actually answer a query, Google has to use one complete set of servers. Since that set is replicated as a fail-safe, it also increases throughput, because if one set is busy, a new query can be routed to the next set, which drives down search time per box."
This article explained to me why they would pick up a Microsoft guy who worked on NT. Yes, I'm sure Google's OS and NT have nothing in common, but all the same, this guy seems motivated and smart. And if they have their own custom OS, I'm sure they're not going to make their own distribution, they just need to work in house.
5 03 03/tc_zd/146950
i ng -software.html
http://news.yahoo.com/news?tmpl=story&u=/zd/200
blog:
http://mark-lucovsky.blogspot.com/2005/02/shipp
I've been putting movie reviews on my web page for a while now, and I've increasingly noticed that google will point people at them even though they search for stuff that isn't on the page. For example, I've had a number of hits where people search for 'AvP review' (or suchlike) and even though I never include the phrase 'AvP' in my review of Aliens vs Predator.
I was mightily impressed, and not just because it means more people read my stuff. Or at least surf to it.
Another interesting read on search engine technology.
d ocument_view/
http://www.sigsemis.org/columns/swsearch/SSE1104/
http://www.cs.wisc.edu/~dusseau/Classes/CS739/Pape rs/dean.pdf
--
Break the rules. Keep the faith. Fight for love.
They should make a googleCluster Live CD.. ala clusterKnoppix.. ..or perhaps use more of clusterKnoppix features or openmosix..share cpu/mem..
sourceforge is begging for something like this..
Their engineer desktops have special google builds of linux which help them compile things insanely fast with g4, ie hacked p4 (Perforce).
They also have one of the best intranet sites I've seen. Lots of info and services the employees can use, apart from email.
The internal blogs really help with keeping track of projects you're not working on, and what others are doing. Their mailing lists are often usefull too, for example there's a lost and found, for sale, and biking partners list. All kinds of usefull little stuff, taking care of the people with little nice things. Lots of reading too.
-- Robi
-- Robi
they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page.
I think what they mean is that they are working on search algorithms that will implement this. Not that they have already made it publicly available. They want it to work first, and be released second. The problem the you have cropping up most likely occurs with pages that put info in the metadata, and hence don't show up in the page itself.
Fly me to the moon Let me sing among those stars Let me see what spring is like On jupiter and mars
It says they're using clustering, so it might help eliminate pages that contain the words you're looking for but aren't relevant to your current query, in addition to including pages that are relevant but don't contain the words. For example,
the word "tree" may either refer to a data structure (binary, B-,red-black etc.) or to the stuff forests are made of. If my query is "search tree", the words search and tree may show up on a page about people searching for some kind of a tree and on pages about search trees. Assuming they're both popular classes of pages, you're going to end up with some mishmash of results from both classes.
Instead, the clustering algorithm might notice (based on other words that appear on the pages, for example) that pages with 'search' and 'tree' in them fall into two classes. That doesn't help if "search tree" is all it has to go by. But now if I add the words "data structure" to the query, it knows which class of pages I'm interested in, because many pages about binary trees contain the words "data structure" whereas almost none about the quest for trees do. Now it can return pages from the right cluester that it knows are relevant, even if they don't contain the word "data structure" in them.
Alot of this stuff is application of SAN/RAID/Failover technology, which is cool (and we've never seen it so pervasively implemented), but not horribly revolutionary. I think the slickest thing they've developed, but might not get the most attention is their MapReduce framework. The abstract from their paper:
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a _map_ function that processes a key/value pair to generate a set of intermediate key/value pairs, and a _reduce_ function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
It seems that the hard part of building massively parallel applications is efiiciently separating the parallel aspects of a problem from the necessarily serial aspects. If you start with a programming framework+runtime that handles this automatically, this could be a major boon to people running massively parallel applications. Could anyone who does this sort of thing often post their opinion on this?
All google has to do know is figure out a way to charge for it.
Don't blame me, I voted for Baltar.
They're not obligated to share unless they are planning on redistributing the software. They are perfectly free to patch their own software and use the patched versions for their servers without sharing those modifications.
The GPL does not force them to do anything unless they wish to redistribute the software.
this is a sig.
That was Terri Gross on NPR's fresh air.. ..
Tthat was one of my favorite interviews ever i think.
Terri is one of the least technical people, probably ever. Yet her interview was still interesting thanks to little tidbits like that!
The undisclosed location was Santa Clara. I won't get more specific than that, sorry. They had a room jam packed with gear that was improperly cabled and spaced, and they didn't want to pay for redundant cooling. Then again, it wasn't a production site. Someone was almost overcome by the heat just walking between rows of cabinets.
The FSF specifically addresses this question in their GPL FAQ, and notes that internal distribute does not require releasing source.
Do their appliances qualify as redistribution?
Technically, they're leasing a black box to you, so they still own the appliance. We have one here at the office, and we're not allowed to open up those pizza boxes. If there's a problem, they ship us another one or send a tech over.
An "I'm Feeling Lucky" search means less time searching for web pages and more time looking at them.
from the "I'm Feeling LuckyTM" button. Guess they changed it.
peterrenshaw ~ Another Scrappy Startup
A "grep -R google *" In my 2.6.5 kernel tree returns back:
drivers/net/arcfour.c: * by Frank Cusack
drivers/net/ppp_mppe_compress.c: * By Frank Cusack
As established in the links he works in Network Working Group of Google