Slashdot Mirror


The Man Behind Google's Ranking Algorithm

nbauman writes "New York Times interview with Amit Singhal, who is in charge of Google's ranking algorithm. They use 200 "signals" and "classifiers," of which PageRank is only one. "Freshness" defines how many recently changed pages appear in a result. They assumed old pages were better, but when they first introduced Google Finance, the algorithm couldn't find it because it was too new. Some topics are "hot". "When there is a blackout in New York, the first articles appear in 15 minutes; we get queries in two seconds," said Singhal. Classifiers infer information about the type of search, whether it is a product to buy, a place, company or person. One classifier identifies people who aren't famous. Another identifies brand names. A final check encourages "diversity" in the results, for example, a manufacturer's page, a blog review, and a comparison shopping site."

39 of 115 comments (clear)

  1. Hrm, and all this time I though it was... by Anonymous Coward · · Score: 4, Funny

    Pigeon Rank?

    1. Re:Hrm, and all this time I though it was... by UltraAyla · · Score: 4, Informative
  2. Amit Singhal ... by WrongSizeGlass · · Score: 5, Informative

    ... is not to be confused with Amit Singh, who also works at Google and has authored an excellent book on Mac OS X Mac OS X Internals.

    1. Re:Amit Singhal ... by Aeamarth · · Score: 3, Funny

      Isn't he the indian google guy?

  3. Re:apple vs Apple by niheuvel · · Score: 5, Informative

    No, but I DO see the difference between 'appleS' and 'apple', just as the text you're quoting mentions.

  4. ...only one? by dwater · · Score: 4, Funny

    > They use 200 "signals" and "classifiers," of which PageRank is only one.

    How many did they expect PageRank to be? In the words of someone immortal, "There can be only one.".

    --
    Max.
    1. Re:...only one? by rtb61 · · Score: 2, Interesting

      From the results I've been getting lately, they seem to dropping page rank in preference to how many times the words 'google adwords' appears om the page, or more precisely the code for generating them. Totally worthless pages but obviously not worthless for google's bottom line. This story obviously reflects one thing and one thing only, the growing perception in the public's eye of the deteriorating quality of google's results, hence yet another marketing fluff piece, to try to convince them, it just ain't so.

      --
      Chaos - everything, everywhere, everywhen
  5. Re:apple vs Apple by The+New+Andy · · Score: 2, Funny

    Oh yeah. Woops. That isn't as interesting :-)

  6. Re:Google... by Anonymous Coward · · Score: 3, Insightful

    In Soviet Russia, they shoot idiots why don't realize this joke is dead.

  7. Feature Request by rueger · · Score: 4, Insightful

    My ongoing gripe with Google is the number of times when the first page is filled with shopping sites, "review" pages, and click through pages that exist only to grab you onto the way to where you really want to go.

    I would love a switch, or even a subscription, that would allow me to filter these usually useless types of pages and instead show me pages with real content.

    1. Re:Feature Request by Fred_A · · Score: 2, Funny

      Haven't had much trouble with the click through sites but when looking for some information on anything that can potentially be sold (or even, as I recently experienced, has been sold in the not too distant past but hasn't been in the last five years), the shopping sites are a real problem

      This item you're searching for hasn't been in inventory for 6 years since nobody makes it anymore, would you like to read a review ? : be the first to write one !

      Yay.

      --

      May contain traces of nut.
      Made from the freshest electrons.
    2. Re:Feature Request by SilentStrike · · Score: 4, Informative

      This probably does what you want.

      http://www.givemebackmygoogle.com/

      It just negates a whole lot of affliate sites.

      This is part of the query it feeds to Google.

      -inurl:(kelkoo|bizrate|pixmania|dealtime|pricerunn er|dooyoo|pricegrabber|pricewatch|resellerratings| ebay|shopbot|comparestoreprices|ciao|unbeatable|sh opping|epinions|nextag|buy|bestwebbuys)

    3. Re:Feature Request by quiddity · · Score: 4, Informative

      Firefox extension: http://www.customizegoogle.com/ lets you filter out URLs from the results (plus dozens of other useful things).

      You can filter out Wikipedia mirrors (using that extension) with the list here: http://meta.wikimedia.org/wiki/Mirror_filter

      --
      .
      . hmmm
  8. Many other things are goo(gle)d by Xoq+jay · · Score: 3, Interesting

    Pagerank is the source of all wisdom in google... but there is so much more... Like string searching & matching algos, file searching.. you name it.. Just the other day I was searching for books about Google's algorithms... I found zero interesting stuff.. They keep their algorithms secret and out of the public domain... (like they should..). we praise Pagerank, but if we knew what other stuff is there, we would all be members of Church of Google (http://www.thechurchofgoogle.org/) :P

    --
    God had a 7 day deadline... So he made the world in LISP
    1. Re:Many other things are goo(gle)d by mattpointblank · · Score: 3, Insightful

      Could it not simply be that they're not keeping it under wraps to avoid sneaky webmasters manipulating their sites, but to prevent competitors gaining an edge?

    2. Re:Many other things are goo(gle)d by Glacial+Wanderer · · Score: 2, Interesting

      I would agree that's likely the reason that Google won't release their algorithm, but my question was why many people outside of Google insist that Google should keep their algorithm secret. If Google in a moment of financial insanity released their search algorithms to their competition it wouldn't decrease the quality of my search results, actually that might improve my results if someone takes Google's algorithm and improves on it.

    3. Re:Many other things are goo(gle)d by chainLynx · · Score: 2, Informative
  9. Now I understand by Timesprout · · Score: 5, Funny

    Search over the last few years has moved from Give me what I typed to Give me what I want, says Mr. Singhal
    So this is why all my results are links to lesbian porn regardless of what I search for.
    --
    Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
    What truth?
    There is no dupe
  10. Re:Google sucks. by WrongSizeGlass · · Score: 4, Funny

    Google Search is a primitive tool used by fanboys "Googling" for pictures of Natalie Portman. Ha! Shows what you know. The only pics I search for are of a tall drink of Texas water named Patricia Vonne and of Cowboy Neal in his homemade Hulk costume. Who knew the Hulk wore a tri-corner hat & rainbow wrestling boots?
  11. Googling Uncommon Characters and Exact Phrases by Anonymous Coward · · Score: 3, Interesting

    One of the most annoying things about google for me is how it interprets queries with strange characters common to almost all programming languages. A google search for "ruby <<" returns no results related to the ruby append operator. A Simple search for "<<", by itself returns ZERO results.

    1. Re:Googling Uncommon Characters and Exact Phrases by Dun+Malg · · Score: 3, Informative

      One of the most annoying things about google for me is how it interprets queries with strange characters common to almost all programming languages. A google search for "ruby <<" returns no results related to the ruby append operator. A Simple search for "<<", by itself returns ZERO results. Yes, well you see that's a problem common to most search systems. Non-alphanumeric characters tend to be reserved for search logic. It would indeed be nice if there was a way to force literals into the search terms, but for now we just have to make do the way we always have: search for ruby append instead, or (if you don't know what it's called) search for ruby string operators and find out.
      --
      If a job's not worth doing, it's not worth doing right.
    2. Re:Googling Uncommon Characters and Exact Phrases by Animats · · Score: 3, Informative

      Yes. Try to find information on the web about the language "C+@". It's real, and it was developed at Bell Labs some years ago back in the Plan 9 era, but it's unsearchable.

    3. Re:Googling Uncommon Characters and Exact Phrases by Blikkie · · Score: 2, Insightful

      One of the most annoying things about google for me is how it interprets queries with strange characters common to almost all programming languages.

      You should try google code search.

    4. Re:Googling Uncommon Characters and Exact Phrases by drix · · Score: 2, Insightful

      I have the same problem. But if you're searching for actual code, you're better off using a code search engine. Or as others have pointed out, search "ruby append operator" if you're interested in the concept.

      --

      I think there is a world market for maybe five personal web logs.
    5. Re:Googling Uncommon Characters and Exact Phrases by Spy+Hunter · · Score: 2, Interesting

      This is an interesting question that I've often wondered about. It's possible that Google programmers simply went in and special-cased C++ and C#, but I personally think that Google has an automated process which notices that "C++" and "C#" are commonly occurring both in web pages and queries, and then automatically adds them to the list of "strange" tokens to index.

      --
      main(c,r){for(r=32;r;) printf(++c>31?c=!r--,"\n":c<r?" ":~c&r?" `":" #");}
  12. One search feature by Z00L00K · · Score: 5, Interesting
    that has been lost was the "NEAR" keyword that AltaVista used earlier. I found it rather useful.

    This could allow for a better search result when using for example "APPLE NEAR MACINTOSH" or "APPLE NEAR BEATLES"

    Ho hum... Times changes and not always for the better...

    --
    If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
  13. Re:North America Centric by datapharmer · · Score: 2, Informative

    Actually, using -site:.co.uk would yield much better results. Since he will then get everything except .co.uk instead of just .com

    --
    Get a web developer
  14. Toileat seat by rbarreira · · Score: 3, Funny

    Does the algorithm account for the toilet seat's positon?

    --

    The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
  15. Google is human too by polarbeer · · Score: 5, Insightful

    One interesting thing about the article was the down-to-earth lack of abstraction in the problems described, such as the teak patio palo alto problem. Other search engines brag about their web-filtered-by-humans approach, as opposed to the "cold" algorithmic approach of Google. But it turns out Google is pretty human too, only with higher ambitions of creating generalizations from the human observations.

  16. A way to get that by i+kan+reed · · Score: 2, Informative

    Wildcards in strings "apple * macintosh" will return pages with the word macintosh shortly following apple. Not reversable, but still quite useful for that kind of search.

  17. Re:The most annoying thing about Google's results. by Shohat · · Score: 2, Insightful

    Slashdot is as much of a blog as I am a Egyptian gerbil. Slashdot links to stories that generate discussions. Slashdot is NOT about the people that create the posts, but about the people that comment here.

  18. Re:North America Centric by aldheorte · · Score: 2, Funny

    If the UK sites in particular are the ones you want out of you search results, compare these searches on Google:

    digestives london

    digestives london -inurl:.uk

  19. "Millions Of Black Boxes"? by aldheorte · · Score: 3, Interesting

    Not sure about this:

    "Google rarely allows outsiders to visit the unit, and it has been cautious about allowing Mr. Singhal to speak with the news media about the magical, mathematical brew inside the millions of black boxes that power its search engine."

    I could see tens of thousands, maybe hundreds of thousands, but millions?

    1. Re:"Millions Of Black Boxes"? by asninn · · Score: 2, Informative

      This is from a year ago (July 2006):

      Google runs on hundreds of thousands of servers--by one estimate, in excess of 450,000--racked up in thousands of clusters in dozens of data centers around the world.

      If this figure is accurate, a million boxen nowadays doesn't seem out of reach.

      --
      butter the donkey
  20. How does it work by Anonymous Coward · · Score: 5, Informative

    It is rather simple (I am an insider).

    Google breaks pages in words. Then, for evey word it keeps a set which contains all the pages (by hash ID) that contain that word. A set is a data structure with O(1) lookup.

    When you search for "linux+kernel" google just does the set union operation on the two sets.

    Now a "word" is not just a word. In google sees that many people use the combination linux+kernel, a new word is created, the linux+kernel word and it has a set of all the pages that contain it. So when you search for linux+kernel+ppp we find the union of the linux+kernel set and the "ppp" set.

    So every time you search, you make it better for google to create new words. And this is part of the power of this search engine. A new search engine will need some time to gather that empirical data.

    Of course, there are ranks of sets. For example, for the word "ppp" there are, say, two sets. The pages of high rank that contain the word ppp, and the pages of low rank. When you search for ppp+chap, first you get the set union of the high rank sets of the two words, etc.

    Now page rank has several criteria. Here are some:
    well ranked site/domain, linked by well ranked page, document contains relevant words, search term is in the title or url, page rank not lowered by google emploee (level 1), page rank increased, etc.

    It is not very difficult actually.

    (posting AC for a reason).

  21. Re:Algorithm? by mestar · · Score: 2, Insightful

    So how do you call the "thing" that you use to impement a heuristic?

  22. Re:Page rank is only a part of the story by martin-boundary · · Score: 3, Informative
    Read the article, it gives a pretty clear picture of what's going on if you're a little familiar with classification ideas, eg bagging, boosting etc. Don't read further if you're familiar with those terms.

    A classifier is a black box which takes some data as input, and computes one or more scores. The simplest example is a binary classifier, say for spam. You feed some data (eg an email) and you get a score back. If it's a big score say, then the classifier thinks it's spam, and if it's a small score it's not spam. More generally, a classifier could give three scores to represent spam, work, home, and you could pick the best score to get the best choice.

    So you should really think of a classifier as a little program that does one thing really well, and only one thing. For example, you can build a small classifier that looks if the input text is english or russian. That's all it does.

    Now imagine you have 100 engineers, and each engineer has a specialty, and each builds a really small classifier to do one thing well. The logic of each classifier is black boxed, so from the outside it's just a component, kind of like a lego brick. What happens when you feed the output of one lego brick to the input of another lego brick?

    Say you have three classifiers: english spam recognizer, russian spam recognizer, english/russian identifier. You build a harness which uses the english/russian identifier first, and then depending on the output your program connects the english spam recognizer or the russian spam recognizer.

    Now imagine a huge network with some classifiers in parallel and some classifiers in series. At the top there's the query words, and they travel through the network. One of the classifiers might trigger word completion (ie bio -> biography as in the article), another might toggle the "fresh" flag, or the "wikipedia" flag etc. In the end, your output is a complicated query string which goes looking for the web pages.

    The key idea now is to tweak the choice thresholds. To do that, there's no theory. You have to have a set of standard queries with a list of the outputs the algorithm must show. Let's say you have 10,000 of these queries. You run each query through the machine, and you get a yes/no answer for each one, and you try to modify the weights so that you get a good number of correct queries.

    Of course you want to speed things up as much as possible, you can use mathematical tricks to find the best weights, you don't need to go get the actual pages if your output is a query string you just compare the query string with the expected query string etc, but that would be depend on your classifiers, the scheme used to evaluate the test results, and how good your engineers are.

    The point is that there's no magic ingredient, it's all ad-hoc. Edison tried a hundreds of different materials for the filament in his lightbulb. Google is doing the same thing according to the article. What matters for this kind of approach is a huge dataset (ie bigger than any competitors') and a large number of engineers (not just to build enough components, but to deprive its competitors of manpower). The exact details of the classifier components aren't too important if you have a comprehensive way of combining them.

  23. I'm familiar with all this stuff by melted · · Score: 2, Interesting

    And the thing that I want to know is how they evaluate the results. I actually do research in this space right now, and by far the most painful thing is evaluation of results. We have a system that automates most of the work, but there's still a lot of human involvement, and this limits the input dataset size and speed with which we can iterate the improvements.

    1. Re:I'm familiar with all this stuff by martin-boundary · · Score: 3, Interesting
      Good question. I agree with you that the article doesn't say anything valuable in this respect :(

      When you say that your system is limited by human involvement, I presume you mean that implementing new features can have serious impact on the overall design (and therefore on testing procedures)? Feel free to not answer if you can't.

      One thing I found interesting in the article is that Google's system sounds like it scales well. It reminded me of antispam architectures like Brightmail's (if memory serves), which have large numbers of simple heuristics which are chosen by an evolutionary algorithm. The point is that new heuristics can be added trivially without changing the architecture. I think their system used 10,000 when they described it a few years ago at an MIT spam conference. Adjustments were done nightly by monitoring spam honeypots.

      I'd love to see better competition in the search engine space. I hope you succeed at improving your tech.