Google's Technology Explored
RobotWisdom writes "Internetnews offers a moderately detailed peek at Google's technology. For example, they use stripped-down Red Hat on a massively redundant network, and they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page." Additional analysis on InformationWeek and C|Net. From the article: "As a search query comes into the system, it hits a Web server, then is split into chunks of service. One set of index servers contains the index; one set of machines contains one full index. To actually answer a query, Google has to use one complete set of servers. Since that set is replicated as a fail-safe, it also increases throughput, because if one set is busy, a new query can be routed to the next set, which drives down search time per box."
so that pages can match even if none of the words in your query actually appear on the page.
Even pages that come up in my search results now that contain my query don't even have anything to do with what I am looking for. Isn't this just adding to the problem?
How about a Did you mean? option that doesn't compare against spelling, but related topics instead?
That's why projects like wikipedia are so important, and so impressive.
Only a few years ago it could take forever to find any kind of decent information on some topics online or even in libraries. Today, I go to wiki and I'm almost assured to have a FAIRLY reliable source for information, as it's cross checked by peers who have some kind of a personal interest in the subject.
However, there's a downside.
Back when I was in school, researching a subject typically meant going through encyclopedia after encyclopedia, which wasn't a bad thing. I learned quite a bit by being FORCED to over-research topics. Today, I can generally straight-shoot to whatever I need to find, giving my brain a good set of blinders to everything else along the way.
and the obvious question:
where are the patches?
Anybody knows? This is not a GPL question just an ethical one.
The word, "cheap", is used 4 times in the C/Net article that describes Google's "secret of success" -- "buying relatively cheap machines", "cheap commodity PCs", "(Power) becomes a factor in running cheaper operations", "not just buying cheaper components".
They say being frugal is a virtue, which Google has, evidently. What is the lesson here? Holding down the cost and being innovative never fail. I guess.
Sun and Fun
I don't know if this is what TFA was getting at, but in a google cache page you may from time to time see the phrase "These terms only appear in links pointing to this page: ...".
For example, try searching for "miserable failure" on Google. The first result is George Bush's biography on www.whitehouse.gov.
However, the term "miserable failure" doesn't actually show up (yet) in the biography. But, pages that POINT to the biography do include those terms.
As a result, pages can match your search query even if none of the words in your query actually appear on the page.
None of the concepts of computer science are new, but what is ground breaking is Google touching all aspects of computer science to solve a problem. Distributed Databases, Replicated Filesystems, Clustering, Learning algorithms, job scheduling, map/reduce languages, etc. are not new. But they applied each of these sub-domains to 'searching' and 'lots of data'. Using old ideas is _new_ ways is ground breaking. That what everyone does(like Carmack and DOOM3).
In my experience, you can add, "don't want to pay for". Some of the places I have worked for aren't lazy, ignorant of the possibilities; they have made a deliberate decision to work cheap. They will accept the downtime from a quick and dirty design, rather than pay for better design. It's all in the numbers, how much will we lose if we are down.