Google's Technology Explored
RobotWisdom writes "Internetnews offers a moderately detailed peek at Google's technology. For example, they use stripped-down Red Hat on a massively redundant network, and they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page." Additional analysis on InformationWeek and C|Net. From the article: "As a search query comes into the system, it hits a Web server, then is split into chunks of service. One set of index servers contains the index; one set of machines contains one full index. To actually answer a query, Google has to use one complete set of servers. Since that set is replicated as a fail-safe, it also increases throughput, because if one set is busy, a new query can be routed to the next set, which drives down search time per box."
That's now how google does it! This is their REAL secret:
http://www.google.com/technology/pigeonrank.html
If we could /. google, that would impress me
It really is amazing to think of the amount of information and data that we can access so quickly these days. When I stop and think about what my little search query goes through to bring me an almost instant response, it almost seems impossible. Of course the search engine side of this is only one example, but it's a nifty insight into how powerfull our infrastructure is these days. Bravo, mankind.
and they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page
So that's why I can search on the result page for my orginally query and find nothing. And all this time I was blaming Internet Explorer!
Coder's Stone: The programming language quick ref for iPad
The technology that is truly asstounding, is Google's ability to cache itself. Yeah, think about THAT one for a while.
This article explained to me why they would pick up a Microsoft guy who worked on NT. Yes, I'm sure Google's OS and NT have nothing in common, but all the same, this guy seems motivated and smart. And if they have their own custom OS, I'm sure they're not going to make their own distribution, they just need to work in house.
5 03 03/tc_zd/146950
i ng -software.html
http://news.yahoo.com/news?tmpl=story&u=/zd/200
blog:
http://mark-lucovsky.blogspot.com/2005/02/shipp
Google's redundancy theory works on a meta level, as well, according to Hoelzle. One literal meltdown -- a fire at a datacenter in an undisclosed location -- brought out six fire trucks but didn't crash the system.
/.ing could do this. On the other hand, they have a level of redundancy and up time many businesses would kill for.
Gee.. I wish our
Fly me to the moon Let me sing among those stars Let me see what spring is like On jupiter and mars
It's also amazing how much of the general knowledge of the world we *can't* access, because it's unconnected or unpublished.
Just think about how vast and extensive Google's search is, and then think about how little of the World's knowledge and creative achievement it actually can access.
The quantity and breadth of human knowledge is breathtaking, no?
so that pages can match even if none of the words in your query actually appear on the page.
Even pages that come up in my search results now that contain my query don't even have anything to do with what I am looking for. Isn't this just adding to the problem?
How about a Did you mean? option that doesn't compare against spelling, but related topics instead?
From the summary:
they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page.
From the help guide:
By default, Google only returns pages that include all of your search terms.
Which of these is correct? If it's the summary, is there any way to turn this behaviour off? I find it immensely annoying.
Guy asked me for a quarter for a cup of coffee. So I bit him.
Theoretically, he said, if someone searches for "Bay Area cooking class," the system should know that "Berkeley courses: vegetarian cooking" is a good match even though it contains none of the query words.
One word: cooking.
I'm sure the principle is sound. I just think the example is a leetle bit flawed.
What I say does not represent the views of my employers, my friends, my cats, or myself.
I hate that. Don't you hate that? When you type in a search keyword, isn't it because you want that keyword to appear in the documents you find?
This "find tangentially related documents" feature will be fine so long as they make it optional and set it to be off by default. Otherwise, I don't want their idea of what pages I should be looking at polluting my results list.
I call "innovation for the sake of innovation".
--
What short sigs we have -
One hundred and twenty chars!
Too short for haiku.
I've been putting movie reviews on my web page for a while now, and I've increasingly noticed that google will point people at them even though they search for stuff that isn't on the page. For example, I've had a number of hits where people search for 'AvP review' (or suchlike) and even though I never include the phrase 'AvP' in my review of Aliens vs Predator.
I was mightily impressed, and not just because it means more people read my stuff. Or at least surf to it.
Here it is, from one of the Google guys:
Google: A Behind-the-Scenes Look.
Simpy
http://www.google.com/jobs/lunar_job.html
a snippet:
Hivemind harvest in progress..
http://www.cs.wisc.edu/~dusseau/Classes/CS739/Pape rs/dean.pdf
--
Break the rules. Keep the faith. Fight for love.
Do they share these patches with everyone else?
I always thougth that they used NT + Access Database.
They should make a googleCluster Live CD.. ala clusterKnoppix.. ..or perhaps use more of clusterKnoppix features or openmosix..share cpu/mem..
sourceforge is begging for something like this..
Their engineer desktops have special google builds of linux which help them compile things insanely fast with g4, ie hacked p4 (Perforce).
They also have one of the best intranet sites I've seen. Lots of info and services the employees can use, apart from email.
The internal blogs really help with keeping track of projects you're not working on, and what others are doing. Their mailing lists are often usefull too, for example there's a lost and found, for sale, and biking partners list. All kinds of usefull little stuff, taking care of the people with little nice things. Lots of reading too.
-- Robi
-- Robi
and the obvious question:
where are the patches?
Anybody knows? This is not a GPL question just an ethical one.
" pages can match even if none of the words in your query actually appear on the page"
The main flaw I've found in Google's results has been when it returns pages without one of my query words, which doesn't respond to the sense of my query. Sometimes it's changed page content at the same URL, so I go back and get the "cached" page, if it exists. The cached pages reveal in their headings whether the page matched only because the query word was found only in another page linking to the returned page. I'd like their immediate results to show that distinction, and to have links in the results to click around those pages related by my complete query. The current click/back/"cache" combinations are frustratingly disconnected, conflicting with Google's otherwise smooth immediacy.
--
make install -not war
Google's redundancy theory works on a meta level, as well, according to Hoelzle. One literal meltdown -- a fire at a datacenter in an undisclosed location -- brought out six fire trucks but didn't crash the system.
"You don't have just one data center," he said, "you have multiples."
The real idea behind Google Maps is so that as the server catches fire it use it's last cycles to send an eMail to the nearest fire cheif and include a map. I think it would also throw in a GMail invite for incentive.
.\.\att Clare
that the virus which used google could not do it with 10's of thousand of computers, it is not likely that /. can do it.
I prefer the "u" in honour as it seems to be missing these days.
Interesting addendum to that question - Is Google infringing upon copyrighted information by caching EVERY page they run across? That seems like pulling massive amounts of copyrighted Java code or design code or images or etc. into their server for 'personal' use...? Does this break any laws?
My little site.
The word, "cheap", is used 4 times in the C/Net article that describes Google's "secret of success" -- "buying relatively cheap machines", "cheap commodity PCs", "(Power) becomes a factor in running cheaper operations", "not just buying cheaper components".
They say being frugal is a virtue, which Google has, evidently. What is the lesson here? Holding down the cost and being innovative never fail. I guess.
Sun and Fun
Alot of this stuff is application of SAN/RAID/Failover technology, which is cool (and we've never seen it so pervasively implemented), but not horribly revolutionary. I think the slickest thing they've developed, but might not get the most attention is their MapReduce framework. The abstract from their paper:
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a _map_ function that processes a key/value pair to generate a set of intermediate key/value pairs, and a _reduce_ function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
It seems that the hard part of building massively parallel applications is efiiciently separating the parallel aspects of a problem from the necessarily serial aspects. If you start with a programming framework+runtime that handles this automatically, this could be a major boon to people running massively parallel applications. Could anyone who does this sort of thing often post their opinion on this?
All google has to do know is figure out a way to charge for it.
Don't blame me, I voted for Baltar.
Heh, well they could NEVER do that :)
Here's another great idea you inspired that they could also never do (being a commercial company themselves and all).
When I am searching I virtually always want to do one of two distinct things:
1) Sarch only commercial sites for a product to purchase.
2) Search everything but commercial sites for information.
There really should be a "$" flag that you could add (or at least a "!$" flag) to control wheather you see commercial or non-commercial sites in the results list.
Contrary to popular belief, coding is not all free blow-jobs and beer. Those things cost MONEY!
I think the only reason other companies don't do as well as google is due to either laziness or ignorance to some basic things and some advanced things. An index is not the most ground breaking thing in the world. Job delegation and breaking up work is not that ground breaking either. Clustering has been around in concept since forever. Now I ask you, the public, not just you iibbmm, how many applications have you done that use these concepts? Most biz concepts are very simple. They don't try to implement vertex cover or try and do the 3CSAT NP-Complete problems.
Not to downplay google. Google did a great job of implementing a lot of these things: indexing, job delegation and maybe a good beaucracy. Larger companies either are lazy, ignorant or simply don't have to. I've worked for a few companies that "don't have to", but lord, if the places that weren't so ignorant or lazy, they could be powerhouses just by what they could do...
-
ping -f 255.255.255.255 # if only
My wife is studying Library Information Science. In one class, she studied information retrieval. Here's what's interesting: It appears that although Google has much success with determining relevance by using PageRank, it's still very literal about the words you pick. Although it appears to do stemming (ie. 'runner' matches 'running'), it doesn't do anything about synonyms. Now, here, I'll point out that the the textbook for my wife's class was written in like 1995. In the SECOND CHAPTER, they talk about basic query techniques that make use of patterns in documents and AUTOMATICALLY derive what words are synonyms or in some way semantically related. These are long-solved problems. Some search engines employ human-generated lists of synonmyns, and there are whole databases you can download that contain semantic networks.
So, WHY, I ask, is google only now getting around to using these techniques?
Anyhow, the article mentioned that in these early datacentres they experienced something like a 25% hardware failure rate, but that it didn't matter because the software worked around it and the hardware was cheap.
Here's a link to the page where I read all this neat stuff. It's probably mostly about the same stuff as the article we've all just slashdotted, but I won't be albe to tell for a while....
Never eat more than you can lift -- Miss Piggy
Why not enhance the robots.txt format to include a max crawl rate variable? Let the webmaster specify how often a robot is allowed to crawl a page.
http://brandonbloom.name
"they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page."
I have yet to see a "hit" served up by google where it didn't have any words I searched for and it still be relevant. It's especially annoying when I search for exact phrases (such as an error message) and I get something completely different. It's a waste of time so far.
An "I'm Feeling Lucky" search means less time searching for web pages and more time looking at them.
from the "I'm Feeling LuckyTM" button. Guess they changed it.
peterrenshaw ~ Another Scrappy Startup
It's not a great example, but my mind seems to have gone temporarily blank of words that have many synonyms :(
I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment