Slashdot Mirror


The Man Behind Google's Ranking Algorithm

nbauman writes "New York Times interview with Amit Singhal, who is in charge of Google's ranking algorithm. They use 200 "signals" and "classifiers," of which PageRank is only one. "Freshness" defines how many recently changed pages appear in a result. They assumed old pages were better, but when they first introduced Google Finance, the algorithm couldn't find it because it was too new. Some topics are "hot". "When there is a blackout in New York, the first articles appear in 15 minutes; we get queries in two seconds," said Singhal. Classifiers infer information about the type of search, whether it is a product to buy, a place, company or person. One classifier identifies people who aren't famous. Another identifies brand names. A final check encourages "diversity" in the results, for example, a manufacturer's page, a blog review, and a comparison shopping site."

14 of 115 comments (clear)

  1. apple vs Apple by The+New+Andy · · Score: 1, Informative
    The formulas can tell that people who type "apples" are likely to be thinking about fruit, while those who type "Apple" are mulling computers or iPods.

    Well the results for both "apple" and "Apple" are identical for me (apple computer dominated), with the exception of the text in the ads on the right hand side (which are both for apple computers). Maybe they are doing other stuff (Linux users prefer computers over fruit?).

    Does anyone see anything different when they search for "apple" versus "Apple"?

    1. Re:apple vs Apple by niheuvel · · Score: 5, Informative

      No, but I DO see the difference between 'appleS' and 'apple', just as the text you're quoting mentions.

  2. Amit Singhal ... by WrongSizeGlass · · Score: 5, Informative

    ... is not to be confused with Amit Singh, who also works at Google and has authored an excellent book on Mac OS X Mac OS X Internals.

  3. Re:North America Centric by datapharmer · · Score: 2, Informative

    Actually, using -site:.co.uk would yield much better results. Since he will then get everything except .co.uk instead of just .com

    --
    Get a web developer
  4. Re:Googling Uncommon Characters and Exact Phrases by Dun+Malg · · Score: 3, Informative

    One of the most annoying things about google for me is how it interprets queries with strange characters common to almost all programming languages. A google search for "ruby <<" returns no results related to the ruby append operator. A Simple search for "<<", by itself returns ZERO results. Yes, well you see that's a problem common to most search systems. Non-alphanumeric characters tend to be reserved for search logic. It would indeed be nice if there was a way to force literals into the search terms, but for now we just have to make do the way we always have: search for ruby append instead, or (if you don't know what it's called) search for ruby string operators and find out.
    --
    If a job's not worth doing, it's not worth doing right.
  5. Re:Hrm, and all this time I though it was... by UltraAyla · · Score: 4, Informative
  6. Re:Googling Uncommon Characters and Exact Phrases by Animats · · Score: 3, Informative

    Yes. Try to find information on the web about the language "C+@". It's real, and it was developed at Bell Labs some years ago back in the Plan 9 era, but it's unsearchable.

  7. A way to get that by i+kan+reed · · Score: 2, Informative

    Wildcards in strings "apple * macintosh" will return pages with the word macintosh shortly following apple. Not reversable, but still quite useful for that kind of search.

  8. Re:Many other things are goo(gle)d by chainLynx · · Score: 2, Informative
  9. Re:Feature Request by SilentStrike · · Score: 4, Informative

    This probably does what you want.

    http://www.givemebackmygoogle.com/

    It just negates a whole lot of affliate sites.

    This is part of the query it feeds to Google.

    -inurl:(kelkoo|bizrate|pixmania|dealtime|pricerunn er|dooyoo|pricegrabber|pricewatch|resellerratings| ebay|shopbot|comparestoreprices|ciao|unbeatable|sh opping|epinions|nextag|buy|bestwebbuys)

  10. Re:Feature Request by quiddity · · Score: 4, Informative

    Firefox extension: http://www.customizegoogle.com/ lets you filter out URLs from the results (plus dozens of other useful things).

    You can filter out Wikipedia mirrors (using that extension) with the list here: http://meta.wikimedia.org/wiki/Mirror_filter

    --
    .
    . hmmm
  11. How does it work by Anonymous Coward · · Score: 5, Informative

    It is rather simple (I am an insider).

    Google breaks pages in words. Then, for evey word it keeps a set which contains all the pages (by hash ID) that contain that word. A set is a data structure with O(1) lookup.

    When you search for "linux+kernel" google just does the set union operation on the two sets.

    Now a "word" is not just a word. In google sees that many people use the combination linux+kernel, a new word is created, the linux+kernel word and it has a set of all the pages that contain it. So when you search for linux+kernel+ppp we find the union of the linux+kernel set and the "ppp" set.

    So every time you search, you make it better for google to create new words. And this is part of the power of this search engine. A new search engine will need some time to gather that empirical data.

    Of course, there are ranks of sets. For example, for the word "ppp" there are, say, two sets. The pages of high rank that contain the word ppp, and the pages of low rank. When you search for ppp+chap, first you get the set union of the high rank sets of the two words, etc.

    Now page rank has several criteria. Here are some:
    well ranked site/domain, linked by well ranked page, document contains relevant words, search term is in the title or url, page rank not lowered by google emploee (level 1), page rank increased, etc.

    It is not very difficult actually.

    (posting AC for a reason).

  12. Re:Page rank is only a part of the story by martin-boundary · · Score: 3, Informative
    Read the article, it gives a pretty clear picture of what's going on if you're a little familiar with classification ideas, eg bagging, boosting etc. Don't read further if you're familiar with those terms.

    A classifier is a black box which takes some data as input, and computes one or more scores. The simplest example is a binary classifier, say for spam. You feed some data (eg an email) and you get a score back. If it's a big score say, then the classifier thinks it's spam, and if it's a small score it's not spam. More generally, a classifier could give three scores to represent spam, work, home, and you could pick the best score to get the best choice.

    So you should really think of a classifier as a little program that does one thing really well, and only one thing. For example, you can build a small classifier that looks if the input text is english or russian. That's all it does.

    Now imagine you have 100 engineers, and each engineer has a specialty, and each builds a really small classifier to do one thing well. The logic of each classifier is black boxed, so from the outside it's just a component, kind of like a lego brick. What happens when you feed the output of one lego brick to the input of another lego brick?

    Say you have three classifiers: english spam recognizer, russian spam recognizer, english/russian identifier. You build a harness which uses the english/russian identifier first, and then depending on the output your program connects the english spam recognizer or the russian spam recognizer.

    Now imagine a huge network with some classifiers in parallel and some classifiers in series. At the top there's the query words, and they travel through the network. One of the classifiers might trigger word completion (ie bio -> biography as in the article), another might toggle the "fresh" flag, or the "wikipedia" flag etc. In the end, your output is a complicated query string which goes looking for the web pages.

    The key idea now is to tweak the choice thresholds. To do that, there's no theory. You have to have a set of standard queries with a list of the outputs the algorithm must show. Let's say you have 10,000 of these queries. You run each query through the machine, and you get a yes/no answer for each one, and you try to modify the weights so that you get a good number of correct queries.

    Of course you want to speed things up as much as possible, you can use mathematical tricks to find the best weights, you don't need to go get the actual pages if your output is a query string you just compare the query string with the expected query string etc, but that would be depend on your classifiers, the scheme used to evaluate the test results, and how good your engineers are.

    The point is that there's no magic ingredient, it's all ad-hoc. Edison tried a hundreds of different materials for the filament in his lightbulb. Google is doing the same thing according to the article. What matters for this kind of approach is a huge dataset (ie bigger than any competitors') and a large number of engineers (not just to build enough components, but to deprive its competitors of manpower). The exact details of the classifier components aren't too important if you have a comprehensive way of combining them.

  13. Re:"Millions Of Black Boxes"? by asninn · · Score: 2, Informative

    This is from a year ago (July 2006):

    Google runs on hundreds of thousands of servers--by one estimate, in excess of 450,000--racked up in thousands of clusters in dozens of data centers around the world.

    If this figure is accurate, a million boxen nowadays doesn't seem out of reach.

    --
    butter the donkey