Slashdot Mirror


Google's Technology Explored

RobotWisdom writes "Internetnews offers a moderately detailed peek at Google's technology. For example, they use stripped-down Red Hat on a massively redundant network, and they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page." Additional analysis on InformationWeek and C|Net. From the article: "As a search query comes into the system, it hits a Web server, then is split into chunks of service. One set of index servers contains the index; one set of machines contains one full index. To actually answer a query, Google has to use one complete set of servers. Since that set is replicated as a fail-safe, it also increases throughput, because if one set is busy, a new query can be routed to the next set, which drives down search time per box."

25 of 294 comments (clear)

  1. More useless search results? by SerialEx13 · · Score: 4, Insightful

    so that pages can match even if none of the words in your query actually appear on the page.

    Even pages that come up in my search results now that contain my query don't even have anything to do with what I am looking for. Isn't this just adding to the problem?

    How about a Did you mean? option that doesn't compare against spelling, but related topics instead?

  2. Too celver for their own good? by Mirk · · Score: 2, Insightful
    From the article summary:
    They're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page.

    I hate that. Don't you hate that? When you type in a search keyword, isn't it because you want that keyword to appear in the documents you find?

    This "find tangentially related documents" feature will be fine so long as they make it optional and set it to be off by default. Otherwise, I don't want their idea of what pages I should be looking at polluting my results list.

    I call "innovation for the sake of innovation".

    --

    --
    What short sigs we have -
    One hundred and twenty chars!
    Too short for haiku.
    1. Re:Too celver for their own good? by mythosaz · · Score: 2, Insightful
      The entire point of a search engine like Google is that they do give you their idea of what pages your query should return.

      That's how it works...

  3. Re:Also Amazing: How much we miss by iibbmm · · Score: 5, Insightful

    That's why projects like wikipedia are so important, and so impressive.

    Only a few years ago it could take forever to find any kind of decent information on some topics online or even in libraries. Today, I go to wiki and I'm almost assured to have a FAIRLY reliable source for information, as it's cross checked by peers who have some kind of a personal interest in the subject.

    However, there's a downside.

    Back when I was in school, researching a subject typically meant going through encyclopedia after encyclopedia, which wasn't a bad thing. I learned quite a bit by being FORCED to over-research topics. Today, I can generally straight-shoot to whatever I need to find, giving my brain a good set of blinders to everything else along the way.

  4. kernel patches? by alphan · · Score: 4, Insightful
    Moreover, Google has created its own patches for things that haven't been fixed in the original kernel.

    and the obvious question:

    where are the patches?

    Anybody knows? This is not a GPL question just an ethical one.

    1. Re:kernel patches? by Anonymous Coward · · Score: 1, Insightful

      It _might_ be a GPL question. It depends if Google is distributing their patches in their corporate intranet search applications.

    2. Re:kernel patches? by DeKO · · Score: 2, Insightful

      If you consider the "freedom" involved in Free Software, you'll notice that they use their modified software for their own purposes. They are free to use the software in any way, they are free to modify it. And they aren't distributing it, so they aren't distributing the source code of their changes. I don't see any problem with it.

    3. Re:kernel patches? by The+Bungi · · Score: 3, Insightful
      where are the patches?

      They'll tell you as soon as you point out where or how they are distributing them (yes, that's why it wasn't a GPL question).

      Why should Google be "ethical"? Likely these modifications are part of their IP trove, which keeps them ahead of the (already heated up) competition.

      If you don't like the way someone uses the software you're giving away then perhaps you shouldn't give it away, or maybe it's just that the license is flawed. It's dumb to expect people who run billion-dollar publicly traded corporations to be "ethical". Mom and pop shops are "ethical".

      The whole concept of "free software" as encoded by the GPL is increasingly being outmoded by things like server-bound distributed applications (see that clumsy Affero GPL) and companies like Google which have strategic interests in the stuff. It's called progress.

    4. Re:kernel patches? by AsimovBesterClarke · · Score: 2, Insightful

      > and the obvious question:
      >
      > where are the patches?

      No. The obvious question is "WHAT are those patches?" Followed by "where are the patches?"

      --
      Ads are broken.
    5. Re:kernel patches? by digidave · · Score: 2, Insightful

      Stop trying to make it a semantic argument. Distributing according to the GPL is not the same as patching your own systems and I'm sure you know that.

      The only question is whether or not Google is selling these patches as part of their appliances.

      --
      The global economy is a great thing until you feel it locally.
  5. Re:/. effect by Anonymous Coward · · Score: 2, Insightful

    Perl is a great language, and I love it, but that does not mean that you have to use it for everything.

    while true; do wget www.google.com; done

    seems better to me.

  6. considering.... by WindBourne · · Score: 2, Insightful

    that the virus which used google could not do it with 10's of thousand of computers, it is not likely that /. can do it.

    --
    I prefer the "u" in honour as it seems to be missing these days.
  7. Re:Question... by TreeHead · · Score: 2, Insightful

    ;i was wondering the same thing. do modifications of this sort fall under the GPL? if so, isn't google required to share them with the public, or are "patches" not considered "modifications" to the software?

    ;treehead

    --

    "If any part Linux was stolen, then Windows was the biggest heist in history."

  8. Re:Question -- Is any of this considered P2P? by MrAnnoyanceToYou · · Score: 2, Insightful

    Interesting addendum to that question - Is Google infringing upon copyrighted information by caching EVERY page they run across? That seems like pulling massive amounts of copyrighted Java code or design code or images or etc. into their server for 'personal' use...? Does this break any laws?

  9. Frugal Google by Sundroid · · Score: 3, Insightful

    The word, "cheap", is used 4 times in the C/Net article that describes Google's "secret of success" -- "buying relatively cheap machines", "cheap commodity PCs", "(Power) becomes a factor in running cheaper operations", "not just buying cheaper components".

    They say being frugal is a virtue, which Google has, evidently. What is the lesson here? Holding down the cost and being innovative never fail. I guess.

  10. Re:Meltdown? by Anonymous Coward · · Score: 1, Insightful

    Any company could have that kind of uptime - with the right amount of money....

  11. Re:Also Amazing: How much we miss by natedubbya · · Score: 2, Insightful
    The quantity and breadth of human knowledge is breathtaking, no?

    Well, I think you haven't studied enough if you think this. When you start to realize we actually know very little, then you're getting somewhere.

  12. Nothing else innovating but google? by Anonymous Coward · · Score: 1, Insightful

    Why is there so much "google" on slashdot? I don't get it. Are they these days all the industry has to offer?

    Google == great, but not everything.

  13. Re:no AND needed by M00TP01NT · · Score: 3, Insightful

    I don't know if this is what TFA was getting at, but in a google cache page you may from time to time see the phrase "These terms only appear in links pointing to this page: ...".

    For example, try searching for "miserable failure" on Google. The first result is George Bush's biography on www.whitehouse.gov.

    However, the term "miserable failure" doesn't actually show up (yet) in the biography. But, pages that POINT to the biography do include those terms.

    As a result, pages can match your search query even if none of the words in your query actually appear on the page.

  14. Laziness, ignorance or by sporty · · Score: 2, Insightful

    I think the only reason other companies don't do as well as google is due to either laziness or ignorance to some basic things and some advanced things. An index is not the most ground breaking thing in the world. Job delegation and breaking up work is not that ground breaking either. Clustering has been around in concept since forever. Now I ask you, the public, not just you iibbmm, how many applications have you done that use these concepts? Most biz concepts are very simple. They don't try to implement vertex cover or try and do the 3CSAT NP-Complete problems.

    Not to downplay google. Google did a great job of implementing a lot of these things: indexing, job delegation and maybe a good beaucracy. Larger companies either are lazy, ignorant or simply don't have to. I've worked for a few companies that "don't have to", but lord, if the places that weren't so ignorant or lazy, they could be powerhouses just by what they could do...

    --

    -
    ping -f 255.255.255.255 # if only

    1. Re:Laziness, ignorance or by Kashif+Shaikh · · Score: 4, Insightful

      None of the concepts of computer science are new, but what is ground breaking is Google touching all aspects of computer science to solve a problem. Distributed Databases, Replicated Filesystems, Clustering, Learning algorithms, job scheduling, map/reduce languages, etc. are not new. But they applied each of these sub-domains to 'searching' and 'lots of data'. Using old ideas is _new_ ways is ground breaking. That what everyone does(like Carmack and DOOM3).

    2. Re:Laziness, ignorance or by akirchhoff · · Score: 3, Insightful

      In my experience, you can add, "don't want to pay for". Some of the places I have worked for aren't lazy, ignorant of the possibilities; they have made a deliberate decision to work cheap. They will accept the downtime from a quick and dirty design, rather than pay for better design. It's all in the numbers, how much will we lose if we are down.

  15. Google and it's 1980's search literal-mindedness by Theovon · · Score: 2, Insightful

    My wife is studying Library Information Science. In one class, she studied information retrieval. Here's what's interesting: It appears that although Google has much success with determining relevance by using PageRank, it's still very literal about the words you pick. Although it appears to do stemming (ie. 'runner' matches 'running'), it doesn't do anything about synonyms. Now, here, I'll point out that the the textbook for my wife's class was written in like 1995. In the SECOND CHAPTER, they talk about basic query techniques that make use of patterns in documents and AUTOMATICALLY derive what words are synonyms or in some way semantically related. These are long-solved problems. Some search engines employ human-generated lists of synonmyns, and there are whole databases you can download that contain semantic networks.

    So, WHY, I ask, is google only now getting around to using these techniques?

  16. Re:Also Amazing: How much we miss by Jugalator · · Score: 2, Insightful

    > FAIRLY reliable source for information

    That's the problem. It isn't reliable. For example, one local journalist got burned badly by using that piece of crap to do research during the election.


    Correction: It's "often" reliable.

    You want a better source?

    Sorry, you won't find one. Not a single one at least.

    What you're speaking of is not a problem with Wikipedia, that's a problem with a journalist who doesn't know how to properly research a subject. If a journalist relies on any single source to be perfectly correct, well what can I say... We've been over this exact thing multiple times before on Slashdot, and the most recent article posted here that touched the subject was about a 12 year old finding actual undeniable flaws in Encyclopedia Britannica. The only difference here is that as opposed to Wikipedia, they can survive in a damn book shelf for decades. Or at a minimum a year or so. You take risks in both cases; with Wikipedia it's due to the fluctuating medium, in other cases it may instead be outdated information. If there's anything a researcher has have had hammered into his head during education, it's that theories and knowledge are rarely "final" or "ultimate". And here lies the disadvantages that's generally greater in sources other than Wikipedia than in Wikipedia itself due to how they're revised.

    --
    Beware: In C++, your friends can see your privates!
  17. Re:Google and it's 1980's search literal-mindednes by shish · · Score: 2, Insightful
    *cough*

    It's not a great example, but my mind seems to have gone temporarily blank of words that have many synonyms :(

    --
    I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment