Slashdot Mirror


Google's Technology Explored

RobotWisdom writes "Internetnews offers a moderately detailed peek at Google's technology. For example, they use stripped-down Red Hat on a massively redundant network, and they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page." Additional analysis on InformationWeek and C|Net. From the article: "As a search query comes into the system, it hits a Web server, then is split into chunks of service. One set of index servers contains the index; one set of machines contains one full index. To actually answer a query, Google has to use one complete set of servers. Since that set is replicated as a fail-safe, it also increases throughput, because if one set is busy, a new query can be routed to the next set, which drives down search time per box."

27 of 294 comments (clear)

  1. Truly Amazing. by iibbmm · · Score: 5, Interesting

    It really is amazing to think of the amount of information and data that we can access so quickly these days. When I stop and think about what my little search query goes through to bring me an almost instant response, it almost seems impossible. Of course the search engine side of this is only one example, but it's a nifty insight into how powerfull our infrastructure is these days. Bravo, mankind.

  2. Meltdown? by Ironsides · · Score: 3, Interesting

    Google's redundancy theory works on a meta level, as well, according to Hoelzle. One literal meltdown -- a fire at a datacenter in an undisclosed location -- brought out six fire trucks but didn't crash the system.

    Gee.. I wish our /.ing could do this. On the other hand, they have a level of redundancy and up time many businesses would kill for.

    --
    Fly me to the moon Let me sing among those stars Let me see what spring is like On jupiter and mars
  3. Also Amazing: How much we miss by Ieshan · · Score: 5, Interesting

    It's also amazing how much of the general knowledge of the world we *can't* access, because it's unconnected or unpublished.

    Just think about how vast and extensive Google's search is, and then think about how little of the World's knowledge and creative achievement it actually can access.

    The quantity and breadth of human knowledge is breathtaking, no?

    1. Re:Also Amazing: How much we miss by Skim123 · · Score: 2, Interesting
      Also with computers there's the whole cut and paste thing... at least with a printed encyclopedia you had to read the content when writing your report.

      Technology has the ability to improve everyone's collective IQ, but also has the ability to dumb down the populace. Kind of like TV. I remember tutoring an elementary student when I was a high school student back in '95 or so, and he couldn't do simple math (addition, subtraction, etc.) without his calculator. Sad...

      --

      I could not justify my existence if I were a turkey farmer. Would I terminate myself? Undoubtably, yes.

    2. Re:Also Amazing: How much we miss by jon787 · · Score: 2, Interesting

      Not only that, but all the information we index and then can't retrieve!

      "We have an embarrassment of riches in that we're able to store more than we can access. Capacities continue to double each year, while access times are improving at 10 percent per year. So, we have a vastly larger storage pool, with a relatively narrow pipeline into it." -- Jim Gray, Microsoft Research.

      --
      X(7): A program for managing terminal windows. See also screen(1).
    3. Re:Also Amazing: How much we miss by Kazoo+the+Clown · · Score: 2, Interesting

      I think it might be pretty amazing to find out what we can't easily access, even that which is published on the net. A simple example: you can't differentiate "net" from ".net" on google, and net is an extremely common word so it is next to useless as a qualifier if your searching for info on the ".net" equivalent to anything common. Or try searching for the smiley face: ":-)". While those may be trivial and uninteresting specific examples, they illustrate at least one area where "you can't find it through Google". There's entire categories of things you can't find on Google, sometimes not because it's not indexed at all, but because you find too much and the needed qualifier isn't alphabetic.

      Some areas have gotten better, a search for "furniture polish" does return different results than "polish furniture" (even when both are unquoted in the search), and I seem to remember having gotten stuck on one like that before. Quotes don't always do the trick because sometimes you don't expect the words to be near each other on the desired pages.

      Certainly we've come a long way, but it still can, and should, get even better.

  4. Re:/. effect by SmokeHalo · · Score: 5, Interesting

    It's been tried. From TFA:

    One literal meltdown -- a fire at a datacenter in an undisclosed location -- brought out six fire trucks but didn't crash the system.

    --
    I'm not good in groups. It's difficult to work in a group when you're omnipotent. - Q
  5. no AND needed by tehshen · · Score: 4, Interesting

    From the summary:

    they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page.

    From the help guide:

    By default, Google only returns pages that include all of your search terms.

    Which of these is correct? If it's the summary, is there any way to turn this behaviour off? I find it immensely annoying.

    --
    Guy asked me for a quarter for a cup of coffee. So I bit him.
  6. Video about some of the backend stuff by otisg · · Score: 5, Interesting

    Here it is, from one of the Google guys:
    Google: A Behind-the-Scenes Look.

    --
    Simpy
  7. Question... by kryogen1x · · Score: 4, Interesting
    Moreover, Google has created its own patches for things that haven't been fixed in the original kernel.

    Do they share these patches with everyone else?

    1. Re:Question... by lgw · · Score: 2, Interesting

      Sure, what what are the bounds of "internal distribution" when a maze of subcontractors and wholly-owned subsidiaries are involved?

      --
      Socialism: a lie told by totalitarians and believed by fools.
  8. "The text you entered was not found." by Doc+Ruby · · Score: 4, Interesting

    " pages can match even if none of the words in your query actually appear on the page"

    The main flaw I've found in Google's results has been when it returns pages without one of my query words, which doesn't respond to the sense of my query. Sometimes it's changed page content at the same URL, so I go back and get the "cached" page, if it exists. The cached pages reveal in their headings whether the page matched only because the query word was found only in another page linking to the returned page. I'd like their immediate results to show that distinction, and to have links in the results to click around those pages related by my complete query. The current click/back/"cache" combinations are frustratingly disconnected, conflicting with Google's otherwise smooth immediacy.

    --

    --
    make install -not war

  9. Question -- Is any of this considered P2P? by Didion+Sprague · · Score: 1, Interesting

    Question -- and this may be a dumb one, but I'm going to ask it anyway:

    How much of what Google is doing -- the clustering, the redundancy, the sub-categorization -- how much of this (if any) could be described -- could fit under the mantle of "Peer-to-Peer"? Is anything that Google is doing here remotely considered P2P? (Even if the P2P is what's going on on their own, in-house servers?)

    Obviously, I ask this because of the upcoming supreme court case. And I ask because it struck me as I read the article that what Google is doing *seems* to be breaking down complex tasks and simplifying them so that work across the network -- their network, your network -- and I wonder if this is (in theory?) what Peer-to-Peer is doing?

    (I'm thinking, too, of the Google concept of "shards" and how their data is distributed.)

  10. Re:Whats really impressive by Anonymous Coward · · Score: 1, Interesting

    Uh? Google cache is runned by bbernal.com not Google. This is a little better: http://64.233.161.104/search?q=cache:64.233.161.10 4 but still not surprising if you think about it for a while.

  11. Obligatory link to Google research paper by Anonymous Coward · · Score: 1, Interesting
  12. Re:Yeah, I noticed that by generic-man · · Score: 2, Interesting

    Try doing a search for a Macintosh software product. Even though "Mac OS X" was not one of your search terms, Google boldfaces it as though it were!

    I can't reproduce this with another term. I wonder whether this was a manual fix by Google programmers.

    --
    For more information, click here.
  13. Re:Impressive technology but the algorithms aren't by TheAwfulTruth · · Score: 2, Interesting

    Heh, well they could NEVER do that :)

    Here's another great idea you inspired that they could also never do (being a commercial company themselves and all).

    When I am searching I virtually always want to do one of two distinct things:

    1) Sarch only commercial sites for a product to purchase.

    2) Search everything but commercial sites for information.

    There really should be a "$" flag that you could add (or at least a "!$" flag) to control wheather you see commercial or non-commercial sites in the results list.

    --
    Contrary to popular belief, coding is not all free blow-jobs and beer. Those things cost MONEY!
  14. Re:interesting by InfiniteWisdom · · Score: 2, Interesting

    What's interesting is that the notice "Google is not affiliated with the authors of this page nor responsible for its content." goes away when you look at the cache of Google.com! That's a change from the last time I looked at Google's cache of Google a couple of years or so ago.

  15. hardware by r00t · · Score: 1, Interesting

    Google really slaps together a pile of junk.
    Parts fail left and right, and nobody bothers
    to fix them. The software hides all this from
    the users.

    Google even checksums the data, on the assumption
    that it is frequently getting corrupted by all the
    junk hardware they buy.

  16. Re:kernel patches? by rk · · Score: 2, Interesting

    On their own servers, then they're obeying the rules.

    The question is: Do they use these patches on the search appliances they sell, and does that count as "distribution"? I honestly don't know the answer to that question, and I'd like to think Google has sharp legal advisors to go with their sharp technical people.

  17. Re:Oops by ahem · · Score: 3, Interesting
    The actual quote from the article that I saw was:

    The company also is applying machine learning to its system to give better results. Theoretically, he said, if someone searches for "Bay Area cooking class," the system should know that "Berkeley courses: vegetarian cuisine" is a good match even though it contains none of the query words.

    FYI.

    --
    Not A Sig
  18. Re:define: cheap machines by canadiangoose · · Score: 5, Interesting
    I read somewhere that early Google datacentres were built by filling their racks with plywood shelves, then filling each shelf with one power supply running four motherboards each with one HDD. They didn't even use cases. This allowed them to build massively dense datacentres very cheaply. At one point they decided it wasn't worth it to replace dead hardware, so they started placing the racks too close together to be accessible. Why dig through and replace things when you can just keep adding more?

    Anyhow, the article mentioned that in these early datacentres they experienced something like a 25% hardware failure rate, but that it didn't matter because the software worked around it and the hardware was cheap.

    Here's a link to the page where I read all this neat stuff. It's probably mostly about the same stuff as the article we've all just slashdotted, but I won't be albe to tell for a while....

    --
    Never eat more than you can lift -- Miss Piggy
  19. "we can't crawl as fast as we would like" by SnprBoB86 · · Score: 3, Interesting

    Why not enhance the robots.txt format to include a max crawl rate variable? Let the webmaster specify how often a robot is allowed to crawl a page.

    --
    http://brandonbloom.name
    1. Re:"we can't crawl as fast as we would like" by nokilli · · Score: 2, Interesting
      That appears to have been done.

      Take a look at slashdot's robot.txt. First I've seen of the crawl-delay instruction.

      (and isn't it interesting how Google, MSN, and Yahoo have access to content on /. that all the other search engines are prohibited from crawling?)

  20. Re:Oops by Daedala · · Score: 2, Interesting

    Hmm. It must have been corrected; I did a direct copy/paste for my quote.

    --
    What I say does not represent the views of my employers, my friends, my cats, or myself.
  21. Re:Whats really impressive by lgw · · Score: 3, Interesting

    I've done a more studying in that area than most. There has been a lot of over-reacting to paradoxes such as this. Godel's Incompleteness theorem is only narrowly interesting: as soon as you start talking about physical things, these paradoxes are much less imporant.

    A set which contains all sets which do not contain themselves may be a conundrum, but a catalog that lists all catalogs that do not list themselves is merely impossible (trivially impossible, in fact). There are plenty of things that can be described in English that aren't possible things, and most of them aren't very interesting.

    The important consequence of Godel's Theorem to physical things was that mathematics is not a completely accurate model of physical objects. One physical object plus one physical object equals two physical objects, but not every equation describes the physically possible (OK, it was already known that this was the case, but Godel showed it was the case more often than expected).

    --
    Socialism: a lie told by totalitarians and believed by fools.
  22. Non-matching search results... by statemachine · · Score: 2, Interesting

    "they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page."

    I have yet to see a "hit" served up by google where it didn't have any words I searched for and it still be relevant. It's especially annoying when I search for exact phrases (such as an error message) and I get something completely different. It's a waste of time so far.