Google's Technology Explored
RobotWisdom writes "Internetnews offers a moderately detailed peek at Google's technology. For example, they use stripped-down Red Hat on a massively redundant network, and they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page." Additional analysis on InformationWeek and C|Net. From the article: "As a search query comes into the system, it hits a Web server, then is split into chunks of service. One set of index servers contains the index; one set of machines contains one full index. To actually answer a query, Google has to use one complete set of servers. Since that set is replicated as a fail-safe, it also increases throughput, because if one set is busy, a new query can be routed to the next set, which drives down search time per box."
so that pages can match even if none of the words in your query actually appear on the page.
Even pages that come up in my search results now that contain my query don't even have anything to do with what I am looking for. Isn't this just adding to the problem?
How about a Did you mean? option that doesn't compare against spelling, but related topics instead?
I hate that. Don't you hate that? When you type in a search keyword, isn't it because you want that keyword to appear in the documents you find?
This "find tangentially related documents" feature will be fine so long as they make it optional and set it to be off by default. Otherwise, I don't want their idea of what pages I should be looking at polluting my results list.
I call "innovation for the sake of innovation".
--
What short sigs we have -
One hundred and twenty chars!
Too short for haiku.
That's why projects like wikipedia are so important, and so impressive.
Only a few years ago it could take forever to find any kind of decent information on some topics online or even in libraries. Today, I go to wiki and I'm almost assured to have a FAIRLY reliable source for information, as it's cross checked by peers who have some kind of a personal interest in the subject.
However, there's a downside.
Back when I was in school, researching a subject typically meant going through encyclopedia after encyclopedia, which wasn't a bad thing. I learned quite a bit by being FORCED to over-research topics. Today, I can generally straight-shoot to whatever I need to find, giving my brain a good set of blinders to everything else along the way.
and the obvious question:
where are the patches?
Anybody knows? This is not a GPL question just an ethical one.
Perl is a great language, and I love it, but that does not mean that you have to use it for everything.
while true; do wget www.google.com; done
seems better to me.
that the virus which used google could not do it with 10's of thousand of computers, it is not likely that /. can do it.
I prefer the "u" in honour as it seems to be missing these days.
;i was wondering the same thing. do modifications of this sort fall under the GPL? if so, isn't google required to share them with the public, or are "patches" not considered "modifications" to the software?
;treehead
"If any part Linux was stolen, then Windows was the biggest heist in history."
Interesting addendum to that question - Is Google infringing upon copyrighted information by caching EVERY page they run across? That seems like pulling massive amounts of copyrighted Java code or design code or images or etc. into their server for 'personal' use...? Does this break any laws?
My little site.
The word, "cheap", is used 4 times in the C/Net article that describes Google's "secret of success" -- "buying relatively cheap machines", "cheap commodity PCs", "(Power) becomes a factor in running cheaper operations", "not just buying cheaper components".
They say being frugal is a virtue, which Google has, evidently. What is the lesson here? Holding down the cost and being innovative never fail. I guess.
Sun and Fun
Any company could have that kind of uptime - with the right amount of money....
Well, I think you haven't studied enough if you think this. When you start to realize we actually know very little, then you're getting somewhere.
Why is there so much "google" on slashdot? I don't get it. Are they these days all the industry has to offer?
Google == great, but not everything.
I don't know if this is what TFA was getting at, but in a google cache page you may from time to time see the phrase "These terms only appear in links pointing to this page: ...".
For example, try searching for "miserable failure" on Google. The first result is George Bush's biography on www.whitehouse.gov.
However, the term "miserable failure" doesn't actually show up (yet) in the biography. But, pages that POINT to the biography do include those terms.
As a result, pages can match your search query even if none of the words in your query actually appear on the page.
I think the only reason other companies don't do as well as google is due to either laziness or ignorance to some basic things and some advanced things. An index is not the most ground breaking thing in the world. Job delegation and breaking up work is not that ground breaking either. Clustering has been around in concept since forever. Now I ask you, the public, not just you iibbmm, how many applications have you done that use these concepts? Most biz concepts are very simple. They don't try to implement vertex cover or try and do the 3CSAT NP-Complete problems.
Not to downplay google. Google did a great job of implementing a lot of these things: indexing, job delegation and maybe a good beaucracy. Larger companies either are lazy, ignorant or simply don't have to. I've worked for a few companies that "don't have to", but lord, if the places that weren't so ignorant or lazy, they could be powerhouses just by what they could do...
-
ping -f 255.255.255.255 # if only
My wife is studying Library Information Science. In one class, she studied information retrieval. Here's what's interesting: It appears that although Google has much success with determining relevance by using PageRank, it's still very literal about the words you pick. Although it appears to do stemming (ie. 'runner' matches 'running'), it doesn't do anything about synonyms. Now, here, I'll point out that the the textbook for my wife's class was written in like 1995. In the SECOND CHAPTER, they talk about basic query techniques that make use of patterns in documents and AUTOMATICALLY derive what words are synonyms or in some way semantically related. These are long-solved problems. Some search engines employ human-generated lists of synonmyns, and there are whole databases you can download that contain semantic networks.
So, WHY, I ask, is google only now getting around to using these techniques?
> FAIRLY reliable source for information
That's the problem. It isn't reliable. For example, one local journalist got burned badly by using that piece of crap to do research during the election.
Correction: It's "often" reliable.
You want a better source?
Sorry, you won't find one. Not a single one at least.
What you're speaking of is not a problem with Wikipedia, that's a problem with a journalist who doesn't know how to properly research a subject. If a journalist relies on any single source to be perfectly correct, well what can I say... We've been over this exact thing multiple times before on Slashdot, and the most recent article posted here that touched the subject was about a 12 year old finding actual undeniable flaws in Encyclopedia Britannica. The only difference here is that as opposed to Wikipedia, they can survive in a damn book shelf for decades. Or at a minimum a year or so. You take risks in both cases; with Wikipedia it's due to the fluctuating medium, in other cases it may instead be outdated information. If there's anything a researcher has have had hammered into his head during education, it's that theories and knowledge are rarely "final" or "ultimate". And here lies the disadvantages that's generally greater in sources other than Wikipedia than in Wikipedia itself due to how they're revised.
Beware: In C++, your friends can see your privates!
It's not a great example, but my mind seems to have gone temporarily blank of words that have many synonyms :(
I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment