Google's Technology Explored
RobotWisdom writes "Internetnews offers a moderately detailed peek at Google's technology. For example, they use stripped-down Red Hat on a massively redundant network, and they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page." Additional analysis on InformationWeek and C|Net. From the article: "As a search query comes into the system, it hits a Web server, then is split into chunks of service. One set of index servers contains the index; one set of machines contains one full index. To actually answer a query, Google has to use one complete set of servers. Since that set is replicated as a fail-safe, it also increases throughput, because if one set is busy, a new query can be routed to the next set, which drives down search time per box."
That's now how google does it! This is their REAL secret:
http://www.google.com/technology/pigeonrank.html
If we could /. google, that would impress me
It really is amazing to think of the amount of information and data that we can access so quickly these days. When I stop and think about what my little search query goes through to bring me an almost instant response, it almost seems impossible. Of course the search engine side of this is only one example, but it's a nifty insight into how powerfull our infrastructure is these days. Bravo, mankind.
and they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page
So that's why I can search on the result page for my orginally query and find nothing. And all this time I was blaming Internet Explorer!
Coder's Stone: The programming language quick ref for iPad
The technology that is truly asstounding, is Google's ability to cache itself. Yeah, think about THAT one for a while.
This article explained to me why they would pick up a Microsoft guy who worked on NT. Yes, I'm sure Google's OS and NT have nothing in common, but all the same, this guy seems motivated and smart. And if they have their own custom OS, I'm sure they're not going to make their own distribution, they just need to work in house.
5 03 03/tc_zd/146950
i ng -software.html
http://news.yahoo.com/news?tmpl=story&u=/zd/200
blog:
http://mark-lucovsky.blogspot.com/2005/02/shipp
Google's redundancy theory works on a meta level, as well, according to Hoelzle. One literal meltdown -- a fire at a datacenter in an undisclosed location -- brought out six fire trucks but didn't crash the system.
/.ing could do this. On the other hand, they have a level of redundancy and up time many businesses would kill for.
Gee.. I wish our
Fly me to the moon Let me sing among those stars Let me see what spring is like On jupiter and mars
It's also amazing how much of the general knowledge of the world we *can't* access, because it's unconnected or unpublished.
Just think about how vast and extensive Google's search is, and then think about how little of the World's knowledge and creative achievement it actually can access.
The quantity and breadth of human knowledge is breathtaking, no?
so that pages can match even if none of the words in your query actually appear on the page.
Even pages that come up in my search results now that contain my query don't even have anything to do with what I am looking for. Isn't this just adding to the problem?
How about a Did you mean? option that doesn't compare against spelling, but related topics instead?
From the summary:
they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page.
From the help guide:
By default, Google only returns pages that include all of your search terms.
Which of these is correct? If it's the summary, is there any way to turn this behaviour off? I find it immensely annoying.
Guy asked me for a quarter for a cup of coffee. So I bit him.
Theoretically, he said, if someone searches for "Bay Area cooking class," the system should know that "Berkeley courses: vegetarian cooking" is a good match even though it contains none of the query words.
One word: cooking.
I'm sure the principle is sound. I just think the example is a leetle bit flawed.
What I say does not represent the views of my employers, my friends, my cats, or myself.
I hate that. Don't you hate that? When you type in a search keyword, isn't it because you want that keyword to appear in the documents you find?
This "find tangentially related documents" feature will be fine so long as they make it optional and set it to be off by default. Otherwise, I don't want their idea of what pages I should be looking at polluting my results list.
I call "innovation for the sake of innovation".
--
What short sigs we have -
One hundred and twenty chars!
Too short for haiku.
I've been putting movie reviews on my web page for a while now, and I've increasingly noticed that google will point people at them even though they search for stuff that isn't on the page. For example, I've had a number of hits where people search for 'AvP review' (or suchlike) and even though I never include the phrase 'AvP' in my review of Aliens vs Predator.
I was mightily impressed, and not just because it means more people read my stuff. Or at least surf to it.
Another interesting read on search engine technology.
d ocument_view/
http://www.sigsemis.org/columns/swsearch/SSE1104/
Here it is, from one of the Google guys:
Google: A Behind-the-Scenes Look.
Simpy
http://www.google.com/jobs/lunar_job.html
a snippet:
Hivemind harvest in progress..
Looks like I was too dmub for my own good.
--
What short sigs we have -
One hundred and twenty chars!
Too short for haiku.
http://www.cs.wisc.edu/~dusseau/Classes/CS739/Pape rs/dean.pdf
--
Break the rules. Keep the faith. Fight for love.
Do they share these patches with everyone else?
I always thougth that they used NT + Access Database.
They should make a googleCluster Live CD.. ala clusterKnoppix.. ..or perhaps use more of clusterKnoppix features or openmosix..share cpu/mem..
sourceforge is begging for something like this..
Their engineer desktops have special google builds of linux which help them compile things insanely fast with g4, ie hacked p4 (Perforce).
They also have one of the best intranet sites I've seen. Lots of info and services the employees can use, apart from email.
The internal blogs really help with keeping track of projects you're not working on, and what others are doing. Their mailing lists are often usefull too, for example there's a lost and found, for sale, and biking partners list. All kinds of usefull little stuff, taking care of the people with little nice things. Lots of reading too.
-- Robi
-- Robi
and the obvious question:
where are the patches?
Anybody knows? This is not a GPL question just an ethical one.
" pages can match even if none of the words in your query actually appear on the page"
The main flaw I've found in Google's results has been when it returns pages without one of my query words, which doesn't respond to the sense of my query. Sometimes it's changed page content at the same URL, so I go back and get the "cached" page, if it exists. The cached pages reveal in their headings whether the page matched only because the query word was found only in another page linking to the returned page. I'd like their immediate results to show that distinction, and to have links in the results to click around those pages related by my complete query. The current click/back/"cache" combinations are frustratingly disconnected, conflicting with Google's otherwise smooth immediacy.
--
make install -not war
I saw this one hour presentation on one of the 9000 channels offered with Dish TV. I think the show was something like "Computer Engineering Technology". They'll probably run that episode again.
Anyways, I thought it was interesting and if you get that channel (I think it's by Washington U), you can see it too.
You should have used Google ;)
Guy asked me for a quarter for a cup of coffee. So I bit him.
Google's redundancy theory works on a meta level, as well, according to Hoelzle. One literal meltdown -- a fire at a datacenter in an undisclosed location -- brought out six fire trucks but didn't crash the system.
"You don't have just one data center," he said, "you have multiples."
The real idea behind Google Maps is so that as the server catches fire it use it's last cycles to send an eMail to the nearest fire cheif and include a map. I think it would also throw in a GMail invite for incentive.
.\.\att Clare
No, wait, Gaelic is Ireland. Never mind.
Anyway, you're fine, but the alternative definition for celver is "take one's own sister carnally" so maybe you were being dmub.
Look, I dunno what I'm talking about. What are you reading this for anyway? Get back to work!
Question -- and this may be a dumb one, but I'm going to ask it anyway:
How much of what Google is doing -- the clustering, the redundancy, the sub-categorization -- how much of this (if any) could be described -- could fit under the mantle of "Peer-to-Peer"? Is anything that Google is doing here remotely considered P2P? (Even if the P2P is what's going on on their own, in-house servers?)
Obviously, I ask this because of the upcoming supreme court case. And I ask because it struck me as I read the article that what Google is doing *seems* to be breaking down complex tasks and simplifying them so that work across the network -- their network, your network -- and I wonder if this is (in theory?) what Peer-to-Peer is doing?
(I'm thinking, too, of the Google concept of "shards" and how their data is distributed.)
that the virus which used google could not do it with 10's of thousand of computers, it is not likely that /. can do it.
I prefer the "u" in honour as it seems to be missing these days.
I think CmdrTaco uses Safari to administer the site.*
When you look at Safari what do you see? Apple + Google. One is always shaped by one's enviroment.
--------------------
*I saw it when I was trying to like the new Screen Savers. I've since returned my digital cable box.
.\.\att Clare
I'd love to see the original articles server log file to see how many hits come from the MSN Search dev team.
"Me claiming Satan exist is just as valid as you claiming an atom exists" - 1inChrist
Let me guess... the pages that match just happen to point to advertisers?
Am I part of the core demographic for Swedish Fish?
Sometimes this IS what I want. For instance, maybe I don't know what I'm looking for, thus finding similar concepts can be very handy.
Perhaps Google Search Exact and Google Search General buttons in addition to the Do you feel lucky, Punk? button?
Stupidity... has a habit of getting its way.
The word, "cheap", is used 4 times in the C/Net article that describes Google's "secret of success" -- "buying relatively cheap machines", "cheap commodity PCs", "(Power) becomes a factor in running cheaper operations", "not just buying cheaper components".
They say being frugal is a virtue, which Google has, evidently. What is the lesson here? Holding down the cost and being innovative never fail. I guess.
Sun and Fun
"The downside to cheap machines is, you have to make them work together reliably," Hoelzle said. "These things are cheap and easy to put together. The problem is, these things break.
;when he refers to "cheap machines," is he speaking of software or hardware (or both)? the reason i find this interesting is that linux has a reputation for being very stable as a server operating system and i'm wondering what exactly is "failing" on these so-called "cheap machines"--the operating system or the hardware....
;treehead
"If any part Linux was stolen, then Windows was the biggest heist in history."
What I find to be truly amazing is that there are people who don't believe in 'black magic' like this. ... functionality, because a couple old goats didn't understand it. Seriously.
Where I work, we needed a revision in how data was stored for our applications. What we came up with was rather similar to what Google does, though on a little smaller scale.
What happened to the project? It was torpedoed, sabotaged, generally screwed in the
Stupidity... has a habit of getting its way.
I've pretty much given up hope. All search engines do these days is spit out ad-populated and commercial websites trying to sell something. I'm not trying single out google here.. but their search results are not much different from any other query engine. Try any search today, any topic, and the first 20 results will be for pages trying to sell something, useless portals filled with links or *slightly* relavant pages absolutely crammed full of ads on the left and right, top and bottom.. google, yahoo, a9, whatever... they're all pretty much the same.
I guess the good old days are long gone... it's too bad.
Google might do itself and it's user a favour.... Rank down any page with a '$' in it... and any page with more than 15-20 links.... Just dump those into the abyss.
r.a.s.1974
It's obviously not nearly as important as Google's search engine, so that's ok.
http://www-db.stanford.edu/~backrub/google.html
Why is there so much "google" on slashdot? I don't get it. Are they these days all the industry has to offer?
Google == great, but not everything.
Alot of this stuff is application of SAN/RAID/Failover technology, which is cool (and we've never seen it so pervasively implemented), but not horribly revolutionary. I think the slickest thing they've developed, but might not get the most attention is their MapReduce framework. The abstract from their paper:
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a _map_ function that processes a key/value pair to generate a set of intermediate key/value pairs, and a _reduce_ function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
It seems that the hard part of building massively parallel applications is efiiciently separating the parallel aspects of a problem from the necessarily serial aspects. If you start with a programming framework+runtime that handles this automatically, this could be a major boon to people running massively parallel applications. Could anyone who does this sort of thing often post their opinion on this?
All google has to do know is figure out a way to charge for it.
Don't blame me, I voted for Baltar.
You mean DotDot.org
"lacks the backing of a serious, committed enterprise"
yes, because Google is not a serious, commited enterprise, right?
I would have modded parent as funny.
How do I find the book.
Google really slaps together a pile of junk.
Parts fail left and right, and nobody bothers
to fix them. The software hides all this from
the users.
Google even checksums the data, on the assumption
that it is frequently getting corrupted by all the
junk hardware they buy.
His name is Hercule. :P
Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).
I think the only reason other companies don't do as well as google is due to either laziness or ignorance to some basic things and some advanced things. An index is not the most ground breaking thing in the world. Job delegation and breaking up work is not that ground breaking either. Clustering has been around in concept since forever. Now I ask you, the public, not just you iibbmm, how many applications have you done that use these concepts? Most biz concepts are very simple. They don't try to implement vertex cover or try and do the 3CSAT NP-Complete problems.
Not to downplay google. Google did a great job of implementing a lot of these things: indexing, job delegation and maybe a good beaucracy. Larger companies either are lazy, ignorant or simply don't have to. I've worked for a few companies that "don't have to", but lord, if the places that weren't so ignorant or lazy, they could be powerhouses just by what they could do...
-
ping -f 255.255.255.255 # if only
I believe history will show that Google is man's first successful attempt at a system that has some amount of AI.
so that pages can match even if none of the words in your query actually appear on the page
Look, I put the phrase in quotes because *that's the phrase I'm looking for*. Lately, I've been getting results (even the cached version!) from searches which don't have the quoted strings I'm looking for. Grr.
My wife is studying Library Information Science. In one class, she studied information retrieval. Here's what's interesting: It appears that although Google has much success with determining relevance by using PageRank, it's still very literal about the words you pick. Although it appears to do stemming (ie. 'runner' matches 'running'), it doesn't do anything about synonyms. Now, here, I'll point out that the the textbook for my wife's class was written in like 1995. In the SECOND CHAPTER, they talk about basic query techniques that make use of patterns in documents and AUTOMATICALLY derive what words are synonyms or in some way semantically related. These are long-solved problems. Some search engines employ human-generated lists of synonmyns, and there are whole databases you can download that contain semantic networks.
So, WHY, I ask, is google only now getting around to using these techniques?
Why not enhance the robots.txt format to include a max crawl rate variable? Let the webmaster specify how often a robot is allowed to crawl a page.
http://brandonbloom.name
well, we could introduce a setting into robots.txt where we can tell google how often they can spider your site...
Donald 'Duck' Dunn: We had a band powerful enough to turn goat piss into gasoline.
How are Mac users ever going to overcome their reputation as blithering idiots if we keep letting them talk?
When I tried to log into my gmail accout at the begining of the beta program, I got a Debian welcome screen.
Posted in my blog.
DNA in your Linux: DNALinux
Bloody Imperial, not your wimpy pints.
Sometimes seventeen/Syllables aren't enough to/Express a complete
Nope, that's the Indian version of Slashdot, you Insensitive CLOD!
http://www.robotwisdom.com/
Corporate desktops often have > 80gb of space per system
Much of that space is going unused (if an average of 40gb/system is unused, even 100 desktops present us with 4tb of unused space!)
With tech similar to Google's index/sharding/chunking concepts we could easily put that extra space to good use as a backup repository, and have adequate redundancy
We'd need to add some good encryption, though
Reminds me of the old Steven Wright routine, where he says, "I have a full size map of the world at home...the scale says, "One mile equals one mile"..."
Trying to find non-commercial sites with information about a product you wish to purchase. It can be virtually impossible sometimes.
With the additional keywords data and structure the search will automatically result in pages about computer science search trees only because the two words data and structure most likely will not appear on pages about forests. I don't see why any clustering technology is necessary. Maybe there is a better example? I have an intuitive feeling that making the computer "understand" groups of related topics can be of importance, but I don't quite see how to integrate that feature into a search engine (once you've solved the classification problem).
"they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page."
I have yet to see a "hit" served up by google where it didn't have any words I searched for and it still be relevant. It's especially annoying when I search for exact phrases (such as an error message) and I get something completely different. It's a waste of time so far.
Those of us with sites that can handle it want Google to index us! Bring it on, Google! Make my server your little Google-bitch!
d x (wget google.com) x
it is the double "o" . what is up with that? yah oo, g oo gle , micr o s o ft, n o rt o n (symmantec) etc. what is in c o mm o n?
Google recently went public. If you've drunk sufficient kool-aid, this makes them the last best hope of the tech industry. Imagine the the '90s Internet bubble all focused on one company.
Let's just hope Google doesn't decide to exploit this laser-like market focus by spinning off a host of baby Googles: then you'll be able to buy a GoogleBox PC running a browser-based GoogleOS and make phone calls over your Googlenet connection, using the Googlephone service, and you'll look up phone numbers using Whoogle (for any Google IP lawyers reading this, call me for a license on that last one).
I, for one, welcome our new over-Googlords.
Alternatively, you could learn how to search properly.
I wonder where they got these ideas?
The magic that makes Google tick http://www.zdnet.com.au/insight/software/0,3902376 9,39168647,00.htm
so if i post my url here, http://www.dapoker.com , i will get better search results? Hey guys, don't go to my website, am just wandering if google will find this url and link this with my website.
An "I'm Feeling Lucky" search means less time searching for web pages and more time looking at them.
from the "I'm Feeling LuckyTM" button. Guess they changed it.
peterrenshaw ~ Another Scrappy Startup
Yeah, that's right. I've forgotten how to search. Nothing to do with PageRank being useless.
TWW
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
Many things are much easier said than done. The techniques have existed forever, but dependable and accurate implementation on affordable hardware that can handle the high traffic of searches on large datasets with high reliability is much harder than just pointing to a few pages in a textbook that gives out theory.
A NYC lawyer blogs. http://www.chuangblog.com/
It's not a great example, but my mind seems to have gone temporarily blank of words that have many synonyms :(
I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
That's right, blame the tools. It couldn't possibly be you. Tell me what you want to find and I'll give you a short tutorial on how to find it. Google's not an AI, you know.
Now that you've told everyone how it works, everyone will build one.
Perhaps they have to switch to cooler processors or make modifications to CPU heatsinks.
looks like someone's jealous of his maxed out karma, haha
I want authority. I think perhaps you've missed the point of my original post. There is so much uninformed yakking about things, mainly in blogs, that Google is no longer of any use to someone looking for even slightly non-trivial information. As an example of what I mean, the Titanic did not break apart on the surface. The idea that it did was about before the film was made but it rested on the testimony of a woman who was four at the time; no one else claimed to have seen the stern actually fall into the sea, with the large wave that would have produced.
Now, look for information on the web about the sinking. How long do you have to look before you find out that information? It is there, but it's burried deep and I doubt that you would find it if you didn't know it was there. And a search engine that only finds things you were expecting is not much use really, is it?
TWW
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
Well, your origial post was rather concise, which made it tough to divine what was behind it. When it comes to authority, I think you've misunderstood what Google is. It's an index of the WWW, not an encyclopedia. Besides, encyclopedias, newspapers, TV news & documentaries etc. get things wrong too, sometimes spectacularly, and more often than you might think. To take the Titanic as an example, problems with the history of its sinking apparently go back to the original press coverage at the time, such as the way in which Hearst's media spun the story. Given that, it's hard to see how the problem you're describing has anything to do with Google, or even with the modern-day presence of blogs.
If you want ultimate authority, there are a number of gods I can introduce you to, although you have to agree to believe unquestioningly (have "faith") in the authority of whichever one you pick. I'll note that if you picked Google as your god in this sense, you wouldn't go as far wrong as with some of the other choices out there. Short of that, google for epistemology, and you'll find that the problem you're concerned about isn't going to be solved by any algorithm, ever.
I forgot about the tutorial I promised. It doesn't take checking more than a few sources to find significant differences in the account of the Titanic's sinking. That's not uncommon, and means you have to do some research.
It's quite possible, on something like a historical point of detail, that Googling casually won't find you any answers. However, it will find you many sources, which have references, which you can track down.
In this particular case, the information you're looking for is actually more like informed speculation & analysis, since the available eyewitness reports are unreliable and inconsistent. So, applying epistemological principles, you might say to yourself "how can I verify whether the few eyewitness reports about the breakup of the ship make sense?" Some answers to that are likely to be found in a technical analysis of the sinking, so you look for those.
You can also ask yourself questions like "why didn't the stern make a huge wave which would have swamped the lifeboats?" Inconsistencies in your working hypothesis are useful for drilling down towards a more accurate model of the information you're looking for. You don't need to know in advance what you're looking for, but you need to know how to recognize when you don't know something.
You test the information you find against your current model(s), and there can be give and take on both sides, i.e. you might use a model to provisionally reject certain information.
There's no predetermined way of knowing when you've reached a final conclusion. Ultimately, it comes down to how much work you want to put into it, how much information is available, etc.
When you arrive at a final conclusion, you might even find that it doesn't match any single information source out there. Who's right, you or they? It's difficult to say. If you were intent on answering that question, you'd need to examine the processes used to arrive at other accounts, if possible. A less rigorous process is likely to arrive at a less reliable answer.
As an example, when Brian Williams reported on NBC news the other night that the lead judge in Saddam Hussein's tribunal had been killed, they had apparently received the information from multiple US officials, who in turn had received the information from government sources in Baghdad. Turns out they were wrong, it wasn't the lead judge. But NBC news reported the information as "confirmed", apparently based on having spoken to multiple US governement sources. Obviously, their grasp of these issues is rather limited -- even a superficial analysis would indicate that multiple sources in the US government might match merely because they all got their information from the same place. They made a mistake in reporting the identity of the victim as "confirmed", which they could have avoided if they had applied proper procedures, including asking how reliable their sources are, and whether their sources might be contaminated by common factors. If you don't apply that level of diligence to gathering information, you have to accept that the quality of your information will be lower.
The point I'm trying to make is that before Cameron's film version there was no issue with the story about the stern: it was simply not accepted by anyone who had studied the event, witness statements were not conflicting on this - onely one four year old said it happened, everyone else didn't. Now, searching on Google does not tell you anything about that, it returns the modern day controversy which is wholely unfounded. The point being that all Google's much-vaunted powers, which is what the story is about, are of no help if it can not tell the difference between wittering idiots and authorative studies when it ranks the pages. This makes the effort they're putting into searching seem rather misguided. In many cases returning the results ordered by date of indexing would be vastly more useful than whatever algorithm they really use, so why bother with it?
The big problem is that many people seem to think Google is a research tool and it just isn't. It's really good at confirming preconcieved notions which are inherent in one's search terms but it is worse than useless for telling you what the important ideas or opinions are in many fields. Fields that have been touched by hollywood are particularly badly mangled by the pagerank system which treats a page about historical event like the Titanic sinking lower than a page about the film simply because people link to it.
TWW
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
Has anyone noticed a lot more forum and blog threads having higher rankings on Google since about early February? Perhaps it's just me, but the Index seems cluttered with these threads of late. Brad,