The Man Behind Google's Ranking Algorithm
nbauman writes "New York Times interview with Amit Singhal, who is in charge of Google's ranking algorithm. They use 200 "signals" and "classifiers," of which PageRank is only one. "Freshness" defines how many recently changed pages appear in a result. They assumed old pages were better, but when they first introduced Google Finance, the algorithm couldn't find it because it was too new. Some topics are "hot". "When there is a blackout in New York, the first articles appear in 15 minutes; we get queries in two seconds," said Singhal. Classifiers infer information about the type of search, whether it is a product to buy, a place, company or person. One classifier identifies people who aren't famous. Another identifies brand names. A final check encourages "diversity" in the results, for example, a manufacturer's page, a blog review, and a comparison shopping site."
Pigeon Rank?
Well the results for both "apple" and "Apple" are identical for me (apple computer dominated), with the exception of the text in the ads on the right hand side (which are both for apple computers). Maybe they are doing other stuff (Linux users prefer computers over fruit?).
Does anyone see anything different when they search for "apple" versus "Apple"?
... is not to be confused with Amit Singh, who also works at Google and has authored an excellent book on Mac OS X Mac OS X Internals.
> They use 200 "signals" and "classifiers," of which PageRank is only one.
How many did they expect PageRank to be? In the words of someone immortal, "There can be only one.".
Max.
I wish I could give google.ca a signal to return pages from North America.
I'll search for a product and the first page of results will all be *.co.uk results.
Not much use to that. Makes me think on how to rephrase the search, which is good.
In Soviet Russia, they shoot idiots why don't realize this joke is dead.
My ongoing gripe with Google is the number of times when the first page is filled with shopping sites, "review" pages, and click through pages that exist only to grab you onto the way to where you really want to go.
I would love a switch, or even a subscription, that would allow me to filter these usually useless types of pages and instead show me pages with real content.
Three Squirrels
Pagerank is the source of all wisdom in google... but there is so much more... Like string searching & matching algos, file searching.. you name it.. Just the other day I was searching for books about Google's algorithms... I found zero interesting stuff.. They keep their algorithms secret and out of the public domain... (like they should..). we praise Pagerank, but if we knew what other stuff is there, we would all be members of Church of Google (http://www.thechurchofgoogle.org/) :P
God had a 7 day deadline... So he made the world in LISP
Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
What truth?
There is no dupe
One of the most annoying things about google for me is how it interprets queries with strange characters common to almost all programming languages. A google search for "ruby <<" returns no results related to the ruby append operator. A Simple search for "<<", by itself returns ZERO results.
This could allow for a better search result when using for example "APPLE NEAR MACINTOSH" or "APPLE NEAR BEATLES"
Ho hum... Times changes and not always for the better...
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
All that education, and to be in charge of an algorithm! Damn life is cruel.
Does the algorithm account for the toilet seat's positon?
The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
One interesting thing about the article was the down-to-earth lack of abstraction in the problems described, such as the teak patio palo alto problem. Other search engines brag about their web-filtered-by-humans approach, as opposed to the "cold" algorithmic approach of Google. But it turns out Google is pretty human too, only with higher ambitions of creating generalizations from the human observations.
If only they could solve googlebombing on news.google.com by bloggers with right wing agendas. The left wing agendas seem to be gone already, for some reason.
``Tension, apprehension & dissension have begun!'' - Duffy Wyg&, in Alfred Bester's _The Demolished Man_
I think "NEAR" is implied with Google. That is to say, if you search for "apple macintosh", pages with those two terms in close proximity will rank higher than pages which simply contain the terms. Since Google's exact algorithms are proprietary, I cannot swear to this, but that seems to be the way it behaves in my own use.
What I miss from Alta Vista is the ability to go grouping to set precedence, i.e., parenthesis. I don't have to do this very often, but when I do, I really miss it. The need generally comes about when a given thing has a lot of different names or ways to describe it, and I want to say "this OR that OR (foo AND (bar OR baz))".
dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.
but would they blend a whole Beowulf cluster of them, ????? and then profit?
rewriting history since 2109
Wildcards in strings "apple * macintosh" will return pages with the word macintosh shortly following apple. Not reversable, but still quite useful for that kind of search.
I find it extremely annoying the google indexes blogs.
Blogs are read only by bloggers and the press, and present absolutely no interest to normal people (including me). Currently, because of google's idiotic blog fetish, I have to eliminate 50% of the results just based on URLs, hoping that I won't stumble upon someone's personal ramblings. Blogs became popular only due to google's absolutely unexplainable love to blog content, and sticking it into perfectly normal search results, it's like searching in a world-wide-Myspace now.
The most amazing thing is when Google puts blog search results above the source of the story, to which the blogs are linking in the first place. I'm just waiting for this fad to die out like podcasting did. Unfortunatly, google controls the popularity blogging so it won't die out naturally, google at least has to stop indexing them... or put a "show/hide blog results" checkbox...
My Starcraft 2 Blog
Blogs are read only by bloggers and the press, and present absolutely no interest to normal people (including me).
Considering that you're reading a blog, I think it's pretty fair that your only counting web pages that you think suck as blogs... so of course you don't like the results. Amazingly, no one is willing to tag their blog as "shohat will think this sucks, so please don't search me."
So close and yet so far from the world's perfect ID number
I'd like to know how they transform their queries before running them against the index. I.e. how they decide whether they should throw out the "stop" words (most prepositions, some verbs, some nouns) or keep them, whether they should throw in an alternative spelling or synonym, whether they should throw in a semantically related word or two to increase recall (this is evident when you search for something and get related words highlighted in the results), when to stem and when not to stem.
Those are the things that keep them ahead. Page rank is pretty much solved by now, which is why this dude is allowed to talk about it even at this level of detail.
WRT page rank it'd be interesting to know how they train the classifiers and individual classifier weights. The problem is that human experiments are extremely expensive for this stuff.
Slashdot is as much of a blog as I am a Egyptian gerbil. Slashdot links to stories that generate discussions. Slashdot is NOT about the people that create the posts, but about the people that comment here.
My Starcraft 2 Blog
From TFA:
>>A search-engine tweak gave more weight to pages with phrases like "French Revolution" rather than pages that simply had both words.
So, now search engines are giving more importance to connected words rather than scattered words. How refreshing!
Come now, everyone knows there's no man behind Google's page rank. It's handled entirely by an army of birds.
http://www.google.com/technology/pigeonrank.html
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Not sure about this:
"Google rarely allows outsiders to visit the unit, and it has been cautious about allowing Mr. Singhal to speak with the news media about the magical, mathematical brew inside the millions of black boxes that power its search engine."
I could see tens of thousands, maybe hundreds of thousands, but millions?
Slashdot is very very much a blog, which is a chronologically arranged web page. You're really bitching about personal home pages, which used to exist as regular ole' web pages, but now are blogs because they're easier to setup (no HTML required) and because the "chronological" nature of blogs works very well for journals.
If blogs didn't exist we'd just have more geocities pages getting lots of links.
So close and yet so far from the world's perfect ID number
In the time it took you to post that comment, David Banh finished medical school.
Please, for the good of Humanity, vote Obama.
It is rather simple (I am an insider).
Google breaks pages in words. Then, for evey word it keeps a set which contains all the pages (by hash ID) that contain that word. A set is a data structure with O(1) lookup.
When you search for "linux+kernel" google just does the set union operation on the two sets.
Now a "word" is not just a word. In google sees that many people use the combination linux+kernel, a new word is created, the linux+kernel word and it has a set of all the pages that contain it. So when you search for linux+kernel+ppp we find the union of the linux+kernel set and the "ppp" set.
So every time you search, you make it better for google to create new words. And this is part of the power of this search engine. A new search engine will need some time to gather that empirical data.
Of course, there are ranks of sets. For example, for the word "ppp" there are, say, two sets. The pages of high rank that contain the word ppp, and the pages of low rank. When you search for ppp+chap, first you get the set union of the high rank sets of the two words, etc.
Now page rank has several criteria. Here are some:
well ranked site/domain, linked by well ranked page, document contains relevant words, search term is in the title or url, page rank not lowered by google emploee (level 1), page rank increased, etc.
It is not very difficult actually.
(posting AC for a reason).
This does not seem like an algorithm anymore, it is more of a heuristic. An algorithm can be proved to be correct and it's running time can be analyzed. An algorithm is provably correct whereas a heuristic just works for practical purposes.
"But last year, Mr. Singhal started to worry that Google's balance was off. When the company introduced its new stock quotation service, a search for "Google Finance" couldn't find it. After monitoring similar problems, he assembled a team of three engineers to figure out what to do about them."
But then they changed the algorithm and now Google Finance site is at the top.
> "NEAR" keyword
Isn't that what the single quote (') construct is for: 'widget offbeat'
I could see tens of thousands, maybe hundreds of thousands, but millions?
It's in Google's interest to have competitors think of it as bigger than it is.
So, if they count each IC on a mobo or drive controller, they probably do have millions of black boxes at Google, literally.
Alternately, they could be talking about algorithms, instances thereof, etc., though I like the black IC's better.
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
And the thing that I want to know is how they evaluate the results. I actually do research in this space right now, and by far the most painful thing is evaluation of results. We have a system that automates most of the work, but there's still a lot of human involvement, and this limits the input dataset size and speed with which we can iterate the improvements.
Dude most of the things he talked about are taught in any decent Web Search or Machine Learning course. He is not disclosing any secrets and Page Rank is actually a 5 day homework assignment not a life's work. Google has gone far beyond Page Rank and Page Rank is just the dummy Google likes to wave about so that people are busy trying to beat Page Rank and not their real classifiers. And classifiers are dime a dozen. Tying them up with efficient network and database resources is Google's key contribution. Rest assured the reason Google is doing well is kept pretty secret. (Hint: Its the database and network algos which allow the maintenance of huge databases and indexes in a distributed manner). BTW If you dont believe Page Rank is a HW assignment look at Dr mooney's course page here http://www.cs.utexas.edu/~mooney/ir-course/
**Life is too short to be serious**
One of the New Yorkers munched on cake.
I find it frustrating when i am searching for free market data, often available in the form of press releases or summaries of whitepapers. Things such as the size of a particular software or appliance market.
When i search Google usually gives me information from 2001, 2002, 2003 and it is hard to tell it i want only data from 2006/2007. The problem is that the sites that end up in the search constantly refresh the ads and links around their old stories which makes google think its fresh.
This was not a big problem when most of the internet content was no more than a year old. This problem will get worse unless Google is smarter about recognising that the core content of a page (magazine story, whitepaper etc) was written in 2002 and it is now out of date and should be further down the list than something written in 2007.
Try allintitle: worked for me! It was on the first page. (Well, one link away; however the text "C+@" _was_ in the discription text)
Also try calico. (aka)
Crap. What did the new CSS do with the "Post anonymously" option??
There is only one algorithm that really matters:
For each page in results
if page.HasAdwords=true and Not page.content=junk
page.MoneyRank= page.clickthroughrate * page.AdwordsValue
results.add page
else
Ignore
Endif
next page
results order by Moneyrank DESC
In ancient egypt, they used dream interpretation to study about intelligence and fate. Maybe that should change the algorythm to incorporate http://www.dreamcrowd.com/dream_interpretation