The Man Behind Google's Ranking Algorithm

Hrm, and all this time I though it was... by Anonymous Coward · 2007-06-03 02:53 · Score: 4, Funny

Pigeon Rank?

Re:Hrm, and all this time I though it was... by UltraAyla · 2007-06-03 04:06 · Score: 4, Informative

parent is not offtopic - http://www.google.com/technology/pigeonrank.html

apple vs Apple by The+New+Andy · 2007-06-03 02:54 · Score: 1, Informative

The formulas can tell that people who type "apples" are likely to be thinking about fruit, while those who type "Apple" are mulling computers or iPods.

Well the results for both "apple" and "Apple" are identical for me (apple computer dominated), with the exception of the text in the ads on the right hand side (which are both for apple computers). Maybe they are doing other stuff (Linux users prefer computers over fruit?).

Does anyone see anything different when they search for "apple" versus "Apple"?

Re:apple vs Apple by niheuvel · 2007-06-03 02:57 · Score: 5, Informative

No, but I DO see the difference between 'appleS' and 'apple', just as the text you're quoting mentions.
Re:apple vs Apple by foniksonik · 2007-06-03 03:00 · Score: 1

You have to search for 'apples' plural....

--
A fool throws a stone into a well and a thousand sages can not remove it.
Re:apple vs Apple by The+New+Andy · 2007-06-03 03:10 · Score: 2, Funny

Oh yeah. Woops. That isn't as interesting :-)
Re:apple vs Apple by Aliriza · 2007-06-03 04:10 · Score: 1

The more the algoritms get complexer we'll get more errors and the small leakages of algoritm will bring more search engine spam.
Re:apple vs Apple by Anonymous Coward · 2007-06-03 08:17 · Score: 1

I'll buy you some tampons and show you how to use them. Meet me after school. I'll be in the parking lot with a grey van.
^_^
Re:apple vs Apple by Doctor-Optimal · 2007-06-04 09:21 · Score: 1

Holy shit, 4Chan is leaking again...

--
New punctuation update "~" (no quotes) at the end of a line to indicate sarcasm. ~

Amit Singhal ... by WrongSizeGlass · 2007-06-03 02:55 · Score: 5, Informative

... is not to be confused with Amit Singh, who also works at Google and has authored an excellent book on Mac OS X Mac OS X Internals.

Re:Amit Singhal ... by Aeamarth · 2007-06-03 03:19 · Score: 3, Funny

Isn't he the indian google guy?
Re:Amit Singhal ... by Anonymous Coward · 2007-06-03 06:36 · Score: 0

I highly doubt since all Indian engineers are supposed to be worthless cheap labours with degrees from degree mills.. (at least thats what slashdot tells me)
Re:Amit Singhal ... by pickyouupatnine · 2007-06-03 07:52 · Score: 0, Troll

Could just be the brown American Google guy..

--
_Vishal www.squad9.com
Re:Amit Singhal ... by kunalthakar · 2007-06-03 15:50 · Score: 1

The two most important people behind Google search are actually an Indian (Singhal) and an Israeli (Manber). And inspite of this, the xenophobic people in USA want to restrict H1Bs.

...only one? by dwater · 2007-06-03 02:58 · Score: 4, Funny

> They use 200 "signals" and "classifiers," of which PageRank is only one.

How many did they expect PageRank to be? In the words of someone immortal, "There can be only one.".

--
Max.

Re:...only one? by Anonymous Coward · 2007-06-03 05:52 · Score: 0

"In the words of someone immortal", Kudos to Amit Singhal.
Re:...only one? by Prof.Phreak · 2007-06-03 06:32 · Score: 1

Now, what if they cut out pagerank completely, would their search results still be just as good?

--
"If anything can go wrong, it will." - Murphy
Re:...only one? by Anonymous Coward · 2007-06-03 12:50 · Score: 0

Or as bad? Sigh, people don't remember the old days around 2000, when google used to actually work really well. You could press the I'm feeling lucky button and expect a good response. Now? You have to wade through three pages of crap to find one gem, and it's straight to wikipedia for facts.
Re:...only one? by rtb61 · 2007-06-03 13:44 · Score: 2, Interesting

From the results I've been getting lately, they seem to dropping page rank in preference to how many times the words 'google adwords' appears om the page, or more precisely the code for generating them. Totally worthless pages but obviously not worthless for google's bottom line. This story obviously reflects one thing and one thing only, the growing perception in the public's eye of the deteriorating quality of google's results, hence yet another marketing fluff piece, to try to convince them, it just ain't so.

--
Chaos - everything, everywhere, everywhen

North America Centric by BACPro · 2007-06-03 03:11 · Score: 1

I wish I could give google.ca a signal to return pages from North America.

I'll search for a product and the first page of results will all be *.co.uk results.

Not much use to that. Makes me think on how to rephrase the search, which is good.

Re:North America Centric by TodMinuit · 2007-06-03 03:22 · Score: 1

You could add "site:.com" to the query. That might help.

--
I wonder if I use bold in my signature, people will notice my posts.
Re:North America Centric by Anonymous Coward · 2007-06-03 03:35 · Score: 0

Try adding this:
site:com

or perhaps in your case try:
site:ca
Re:North America Centric by datapharmer · 2007-06-03 03:37 · Score: 2, Informative

Actually, using -site:.co.uk would yield much better results. Since he will then get everything except .co.uk instead of just .com

--
Get a web developer
Re:North America Centric by thePsychologist · 2007-06-03 03:53 · Score: 1

What's wrong with specifying "pages from Canada" or typing "stuff to search site:com site:ca" in the search bar. Not a perfect solution, but it takes away all the co.uk stuff. Or -site:co.uk if those are the only ones bothering you.

--
"What lies behind us, and what lies before us are tiny matters compared to what lies within us." Ralph Waldo Emerson
Re:North America Centric by billcopc · 2007-06-03 04:00 · Score: 0, Offtopic

Are you suggesting that .co.uk sites are of lesser quality than Canadian and American content ?

Next on Slashdot: Scientific study links dental health to website quality.

--
-Billco, Fnarg.com
Re:North America Centric by Anonymous Coward · 2007-06-03 04:32 · Score: 0

Since the GP specifically mentioned products, I'm pretty sure it has more to do with shipping costs from the UK more than the quality of the sites.
Re:North America Centric by aldheorte · 2007-06-03 07:29 · Score: 2, Funny

If the UK sites in particular are the ones you want out of you search results, compare these searches on Google:

digestives london

digestives london -inurl:.uk

Re:Google... by Anonymous Coward · 2007-06-03 03:12 · Score: 3, Insightful

In Soviet Russia, they shoot idiots why don't realize this joke is dead.

Feature Request by rueger · 2007-06-03 03:13 · Score: 4, Insightful

My ongoing gripe with Google is the number of times when the first page is filled with shopping sites, "review" pages, and click through pages that exist only to grab you onto the way to where you really want to go.

I would love a switch, or even a subscription, that would allow me to filter these usually useless types of pages and instead show me pages with real content.

--
Three Squirrels

Re:Feature Request by Fred_A · 2007-06-03 03:46 · Score: 2, Funny

Haven't had much trouble with the click through sites but when looking for some information on anything that can potentially be sold (or even, as I recently experienced, has been sold in the not too distant past but hasn't been in the last five years), the shopping sites are a real problem

This item you're searching for hasn't been in inventory for 6 years since nobody makes it anymore, would you like to read a review ? : be the first to write one !

Yay.

--

May contain traces of nut.
Made from the freshest electrons.
Re:Feature Request by kestasjk · 2007-06-03 05:52 · Score: 1

Try a more specific query, or try a query that excludes "review", "sale", "price", or whatever you like.

I find that most queries give me what I want right away (eg paris hilton), and those that don't (eg lindsay lohan) do give me what I want after narrowing down the sites returned (eg lindsay lohan drunk car -herbie -vomit -intitle:"fan site").

--
// MD_Update(&m,buf,j);
Re:Feature Request by 2short · 2007-06-03 06:22 · Score: 1

I'm fairly confident that the feature you want is one Google is trying very hard to provide. I doubt adding a switch somewhere is the problem.
Re:Feature Request by SilentStrike · 2007-06-03 06:42 · Score: 4, Informative

This probably does what you want.

http://www.givemebackmygoogle.com/

It just negates a whole lot of affliate sites.

This is part of the query it feeds to Google.

-inurl:(kelkoo|bizrate|pixmania|dealtime|pricerunn er|dooyoo|pricegrabber|pricewatch|resellerratings| ebay|shopbot|comparestoreprices|ciao|unbeatable|sh opping|epinions|nextag|buy|bestwebbuys)
Re:Feature Request by quiddity · 2007-06-03 06:51 · Score: 4, Informative

Firefox extension: http://www.customizegoogle.com/ lets you filter out URLs from the results (plus dozens of other useful things).

You can filter out Wikipedia mirrors (using that extension) with the list here: http://meta.wikimedia.org/wiki/Mirror_filter

--
.
. hmmm
Re:Feature Request by Reaperducer · 2007-06-03 08:38 · Score: 1

I would love a switch, or even a subscription, that would allow me to filter these usually useless types of pages and instead show me pages with real content.
Ditto for Google News. I'd love to click something and have all the worthless blogs trying to pass for journalism disappear from the results.

Even worse is that Google News gives high rankings to some "news" web sites that merely steal the content of other sites and then re-publish it as their own. I'm not talking about link aggregators like Fark or Digg, but web sites that steal other people's content, then present it as their own work(*cough*eCanadaNow.com*cough*).

--
-- I'm old enough to have lived through six different meanings of the word "hacker."
Re:Feature Request by iangoldby · 2007-06-03 08:51 · Score: 1

The article summary even implies that the reasons for not providing a filter to remove shopping sites are not technical:
A final check encourages "diversity" in the results, for example, a manufacturer's page, a blog review, and a comparison shopping site."
So if they have an algorithm to ensure that the results contain a good mix including comparison shopping sites, doesn't that imply that they could technically provide exactly the kind of switch that the parent poster asked for - i.e. to exclude those comparison sites?

Or does the fact that so many searches return a first page entirely made up of comparison shopping sites indicate that the "diversity check" simply doesn't work properly yet?
Re:Feature Request by Tarqwak · 2007-06-03 10:29 · Score: 1

My ongoing gripe with Google is the number of times when the first page is filled with shopping sites, "review" pages, and click through pages

Then create your own Google Custom Search Engine or use some existing ones such as Google Search Excluding Shops that's excluding hand picked 700+ shopping and spam sites and gives ranking boost to 160+ websites of IT and other electronics companies.
Re:Feature Request by slacka · 2007-06-03 19:41 · Score: 1

agreed. I hate how these useless sites come up when I'm looking for computer hardware reviews
Re:Feature Request by monk.e.boy · 2007-06-04 00:35 · Score: 1

This (http://www.myserp.com/) probably does it better.

It does this:

In November 2004 we built the first version of MySERP our aim was to help us find more interesting things via search by simply taking out the websites we already knew about or weren't interested in. Instead of opting in to a set of websites or web pages our theory is it will be easier to just opt out of them by default - no more shopping comparison sites, no more affiliate link sites, no more Pay Per Click ad sites.

Pretty cool huh?

monk.e.boy

--
Open source, flash charts
Re:Feature Request by jamiethehutt · 2007-06-04 01:05 · Score: 1

Get google search history. It remembers your searches and what you've clicked on and will try to tailor your results to you. Now when I search for anything Java I get Sun's stuff coming up first, when Wikipedia has an article on anything I search for its in the first 5 results, if I search for a piece of hardware I'll get pages on linux support of said hardware first. It's not perfect, if you search for something you dont usually search for your back with all the junk but it works quite well.

The tinfoil hat people will hate it though.
Re:Feature Request by tirnacopu · 2007-06-04 09:30 · Score: 1

You can also edit this into C:\Program Files\Mozilla Firefox\searchplugins\google.xml (or wherever you have this on your system).

Many other things are goo(gle)d by Xoq+jay · 2007-06-03 03:14 · Score: 3, Interesting

Pagerank is the source of all wisdom in google... but there is so much more... Like string searching & matching algos, file searching.. you name it.. Just the other day I was searching for books about Google's algorithms... I found zero interesting stuff.. They keep their algorithms secret and out of the public domain... (like they should..). we praise Pagerank, but if we knew what other stuff is there, we would all be members of Church of Google (http://www.thechurchofgoogle.org/) :P

--
God had a 7 day deadline... So he made the world in LISP

Re:Many other things are goo(gle)d by Glacial+Wanderer · 2007-06-03 03:58 · Score: 1

Why do so many people so strongly believe that Google needs to keep their page ranking algorithm secret? Couldn't the argument be made that keeping their algorithm secret is analogues to encryption by obfuscation? I don't have a strong opinion one way or another, and maybe I'm missing some simple reason that invalidates this comparison. Perhaps people just feel that it's impossible to come up with a ranking algorithm that can't be cheated without using obfuscation?

--
Hobby Robotics
Re:Many other things are goo(gle)d by mattpointblank · 2007-06-03 04:09 · Score: 3, Insightful

Could it not simply be that they're not keeping it under wraps to avoid sneaky webmasters manipulating their sites, but to prevent competitors gaining an edge?
Re:Many other things are goo(gle)d by Glacial+Wanderer · 2007-06-03 04:23 · Score: 2, Interesting

I would agree that's likely the reason that Google won't release their algorithm, but my question was why many people outside of Google insist that Google should keep their algorithm secret. If Google in a moment of financial insanity released their search algorithms to their competition it wouldn't decrease the quality of my search results, actually that might improve my results if someone takes Google's algorithm and improves on it.

--
Hobby Robotics
Re:Many other things are goo(gle)d by Tickletaint · 2007-06-03 04:38 · Score: 1

GPLv4 will address this by forcing Google to reveal the PageRank algorithm. Mark my words.

--
Make Slashdot readable! See journal.
Re:Many other things are goo(gle)d by Anonymous Coward · 2007-06-03 05:15 · Score: 0

It's time to start an OpenPagerank project.
Re:Many other things are goo(gle)d by Anonymous Coward · 2007-06-03 06:04 · Score: 0

Think of chess. If your opponent tells you where they are going to move before the game begins there isn't any competition.

Obfuscation would be telling your opponent the moves you are going to make and playing entirely differently. (Note to Google, I've got a flock of finches at my disposal, I'm coming to get you!)
Re:Many other things are goo(gle)d by chainLynx · 2007-06-03 06:08 · Score: 2, Informative

Uh... http://labs.google.com/papers.html
Re:Many other things are goo(gle)d by Anonymous Coward · 2007-06-03 16:40 · Score: 0

If you weren't ac, you'd be +5 insightful by now.

Now I understand by Timesprout · 2007-06-03 03:15 · Score: 5, Funny

Search over the last few years has moved from Give me what I typed to Give me what I want, says Mr. Singhal

So this is why all my results are links to lesbian porn regardless of what I search for.

--
Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
What truth?
There is no dupe

Re:Google sucks. by WrongSizeGlass · 2007-06-03 03:17 · Score: 4, Funny

Google Search is a primitive tool used by fanboys "Googling" for pictures of Natalie Portman. Ha! Shows what you know. The only pics I search for are of a tall drink of Texas water named Patricia Vonne and of Cowboy Neal in his homemade Hulk costume. Who knew the Hulk wore a tri-corner hat & rainbow wrestling boots?

Googling Uncommon Characters and Exact Phrases by Anonymous Coward · 2007-06-03 03:24 · Score: 3, Interesting

One of the most annoying things about google for me is how it interprets queries with strange characters common to almost all programming languages. A google search for "ruby <<" returns no results related to the ruby append operator. A Simple search for "<<", by itself returns ZERO results.

Re:Googling Uncommon Characters and Exact Phrases by Dun+Malg · 2007-06-03 04:05 · Score: 3, Informative

One of the most annoying things about google for me is how it interprets queries with strange characters common to almost all programming languages. A google search for "ruby <<" returns no results related to the ruby append operator. A Simple search for "<<", by itself returns ZERO results. Yes, well you see that's a problem common to most search systems. Non-alphanumeric characters tend to be reserved for search logic. It would indeed be nice if there was a way to force literals into the search terms, but for now we just have to make do the way we always have: search for ruby append instead, or (if you don't know what it's called) search for ruby string operators and find out.

--
If a job's not worth doing, it's not worth doing right.
Re:Googling Uncommon Characters and Exact Phrases by Animats · 2007-06-03 04:11 · Score: 3, Informative

Yes. Try to find information on the web about the language "C+@". It's real, and it was developed at Bell Labs some years ago back in the Plan 9 era, but it's unsearchable.
Re:Googling Uncommon Characters and Exact Phrases by Blikkie · 2007-06-03 04:18 · Score: 2, Insightful

One of the most annoying things about google for me is how it interprets queries with strange characters common to almost all programming languages.

You should try google code search.
Re:Googling Uncommon Characters and Exact Phrases by Tickletaint · 2007-06-03 04:50 · Score: 1

So how does Google know to tailor its results for C, C++, and C#, which all return results specific to the requested language, but not for C+@?

--
Make Slashdot readable! See journal.
Re:Googling Uncommon Characters and Exact Phrases by drix · 2007-06-03 05:46 · Score: 2, Insightful

I have the same problem. But if you're searching for actual code, you're better off using a code search engine. Or as others have pointed out, search "ruby append operator" if you're interested in the concept.

--

I think there is a world market for maybe five personal web logs.
Re:Googling Uncommon Characters and Exact Phrases by Animats · 2007-06-03 06:12 · Score: 1

So how does Google know to tailor its results for C, C++, and C#, which all return results specific to the requested language, but not for C+@?
Manually implemented special cases, perhaps. Or Google may not consider the possibility that "@" can be part of a word, which is likely.
Re:Googling Uncommon Characters and Exact Phrases by Jugalator · 2007-06-03 06:27 · Score: 1

Non-alphanumeric characters tend to be reserved for search logic.

True, but I'd hope that at least using quotation marks to search for phrases would also include special characters.

I mean, there can't be any search logic inside quotes anyway; then that would be part of the phrase.
Like "Apples or oranges" won't search for either apples or oranges, but the actualy phrase.

--
Beware: In C++, your friends can see your privates!
Re:Googling Uncommon Characters and Exact Phrases by Spy+Hunter · 2007-06-03 06:54 · Score: 2, Interesting

This is an interesting question that I've often wondered about. It's possible that Google programmers simply went in and special-cased C++ and C#, but I personally think that Google has an automated process which notices that "C++" and "C#" are commonly occurring both in web pages and queries, and then automatically adds them to the list of "strange" tokens to index.

--
main(c,r){for(r=32;r;) printf(++c>31?c=!r--,"\n":c<r?" ":~c&r?" `":" #");}
Re:Googling Uncommon Characters and Exact Phrases by larry+bagina · 2007-06-03 09:36 · Score: 1

google code doesn't discriminate against punctuation characters. (You can even do a regex search).

--
Do you even lift?
These aren't the 'roids you're looking for.
Re:Googling Uncommon Characters and Exact Phrases by Anonymous Coward · 2007-06-03 11:34 · Score: 0

Have a look at http://www.google.com/codesearch
You can specify a query like "lang:ruby <<"
Re:Googling Uncommon Characters and Exact Phrases by zobier · 2007-06-03 19:07 · Score: 1

Google code search lets you search using regular expressions -- but only within code not the whole web AFAIK.

--
Me lost me cookie at the disco.

One search feature by Z00L00K · 2007-06-03 03:27 · Score: 5, Interesting

that has been lost was the "NEAR" keyword that AltaVista used earlier. I found it rather useful.

This could allow for a better search result when using for example "APPLE NEAR MACINTOSH" or "APPLE NEAR BEATLES"

Ho hum... Times changes and not always for the better...

--
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.

Re:One search feature by AlXtreme · 2007-06-03 04:45 · Score: 1

Clusty does something similar. Searching for "Apple" will show categories for OSX and fruit, for instance.

--
This sig is intentionally left blank

Wow by Anonymous Coward · 2007-06-03 03:28 · Score: 0

All that education, and to be in charge of an algorithm! Damn life is cruel.

Toileat seat by rbarreira · 2007-06-03 03:38 · Score: 3, Funny

Does the algorithm account for the toilet seat's positon?

--

The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F

Google is human too by polarbeer · 2007-06-03 04:27 · Score: 5, Insightful

One interesting thing about the article was the down-to-earth lack of abstraction in the problems described, such as the teak patio palo alto problem. Other search engines brag about their web-filtered-by-humans approach, as opposed to the "cold" algorithmic approach of Google. But it turns out Google is pretty human too, only with higher ambitions of creating generalizations from the human observations.

if only... by grikdog · 2007-06-03 04:40 · Score: 1

If only they could solve googlebombing on news.google.com by bloggers with right wing agendas. The left wing agendas seem to be gone already, for some reason.

--
``Tension, apprehension & dissension have begun!'' - Duffy Wyg&, in Alfred Bester's _The Demolished Man_

I think NEAR is implied by DragonHawk · 2007-06-03 05:07 · Score: 1

I think "NEAR" is implied with Google. That is to say, if you search for "apple macintosh", pages with those two terms in close proximity will rank higher than pages which simply contain the terms. Since Google's exact algorithms are proprietary, I cannot swear to this, but that seems to be the way it behaves in my own use.

What I miss from Alta Vista is the ability to go grouping to set precedence, i.e., parenthesis. I don't have to do this very often, but when I do, I really miss it. The need generally comes about when a given thing has a lot of different names or ways to describe it, and I want to say "this OR that OR (foo AND (bar OR baz))".

--

dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.

Re:I think NEAR is implied by porcupine8 · 2007-06-03 07:05 · Score: 1

This is definitely not always the case. I've had this problem a few times recently - the first page or two of results is a mix of a few useful sites and a lot of sites that happen to contain the two words, but on unrelated parts of the page. I have to dig through the results to find what I need. Especially if the unuseful sites are very popular ones and the ones I want are more obscure.

--
Warning: Apple/Nintendo fangirl. Likes her electronics cute & cuddly. May be rabid.

Re:Google... by JustOK · 2007-06-03 05:17 · Score: 1

but would they blend a whole Beowulf cluster of them, ????? and then profit?

--
rewriting history since 2109

A way to get that by i+kan+reed · 2007-06-03 05:52 · Score: 2, Informative

Wildcards in strings "apple * macintosh" will return pages with the word macintosh shortly following apple. Not reversable, but still quite useful for that kind of search.

The most annoying thing about Google's results... by Shohat · 2007-06-03 06:02 · Score: 1, Insightful

I find it extremely annoying the google indexes blogs.
Blogs are read only by bloggers and the press, and present absolutely no interest to normal people (including me). Currently, because of google's idiotic blog fetish, I have to eliminate 50% of the results just based on URLs, hoping that I won't stumble upon someone's personal ramblings. Blogs became popular only due to google's absolutely unexplainable love to blog content, and sticking it into perfectly normal search results, it's like searching in a world-wide-Myspace now.
The most amazing thing is when Google puts blog search results above the source of the story, to which the blogs are linking in the first place. I'm just waiting for this fad to die out like podcasting did. Unfortunatly, google controls the popularity blogging so it won't die out naturally, google at least has to stop indexing them... or put a "show/hide blog results" checkbox...

--
My Starcraft 2 Blog

Re:The most annoying thing about Google's results. by aengblom · 2007-06-03 06:20 · Score: 1

Blogs are read only by bloggers and the press, and present absolutely no interest to normal people (including me).

Considering that you're reading a blog, I think it's pretty fair that your only counting web pages that you think suck as blogs... so of course you don't like the results. Amazingly, no one is willing to tag their blog as "shohat will think this sucks, so please don't search me."

--

So close and yet so far from the world's perfect ID number

Page rank is only a part of the story by melted · 2007-06-03 06:25 · Score: 1

I'd like to know how they transform their queries before running them against the index. I.e. how they decide whether they should throw out the "stop" words (most prepositions, some verbs, some nouns) or keep them, whether they should throw in an alternative spelling or synonym, whether they should throw in a semantically related word or two to increase recall (this is evident when you search for something and get related words highlighted in the results), when to stem and when not to stem.

Those are the things that keep them ahead. Page rank is pretty much solved by now, which is why this dude is allowed to talk about it even at this level of detail.

WRT page rank it'd be interesting to know how they train the classifiers and individual classifier weights. The problem is that human experiments are extremely expensive for this stuff.

Re:Page rank is only a part of the story by martin-boundary · 2007-06-03 14:42 · Score: 3, Informative

Read the article, it gives a pretty clear picture of what's going on if you're a little familiar with classification ideas, eg bagging, boosting etc. Don't read further if you're familiar with those terms.
A classifier is a black box which takes some data as input, and computes one or more scores. The simplest example is a binary classifier, say for spam. You feed some data (eg an email) and you get a score back. If it's a big score say, then the classifier thinks it's spam, and if it's a small score it's not spam. More generally, a classifier could give three scores to represent spam, work, home, and you could pick the best score to get the best choice.
So you should really think of a classifier as a little program that does one thing really well, and only one thing. For example, you can build a small classifier that looks if the input text is english or russian. That's all it does.
Now imagine you have 100 engineers, and each engineer has a specialty, and each builds a really small classifier to do one thing well. The logic of each classifier is black boxed, so from the outside it's just a component, kind of like a lego brick. What happens when you feed the output of one lego brick to the input of another lego brick?
Say you have three classifiers: english spam recognizer, russian spam recognizer, english/russian identifier. You build a harness which uses the english/russian identifier first, and then depending on the output your program connects the english spam recognizer or the russian spam recognizer.
Now imagine a huge network with some classifiers in parallel and some classifiers in series. At the top there's the query words, and they travel through the network. One of the classifiers might trigger word completion (ie bio -> biography as in the article), another might toggle the "fresh" flag, or the "wikipedia" flag etc. In the end, your output is a complicated query string which goes looking for the web pages.
The key idea now is to tweak the choice thresholds. To do that, there's no theory. You have to have a set of standard queries with a list of the outputs the algorithm must show. Let's say you have 10,000 of these queries. You run each query through the machine, and you get a yes/no answer for each one, and you try to modify the weights so that you get a good number of correct queries.
Of course you want to speed things up as much as possible, you can use mathematical tricks to find the best weights, you don't need to go get the actual pages if your output is a query string you just compare the query string with the expected query string etc, but that would be depend on your classifiers, the scheme used to evaluate the test results, and how good your engineers are.
The point is that there's no magic ingredient, it's all ad-hoc. Edison tried a hundreds of different materials for the filament in his lightbulb. Google is doing the same thing according to the article. What matters for this kind of approach is a huge dataset (ie bigger than any competitors') and a large number of engineers (not just to build enough components, but to deprive its competitors of manpower). The exact details of the classifier components aren't too important if you have a comprehensive way of combining them.

Re:The most annoying thing about Google's results. by Shohat · 2007-06-03 06:52 · Score: 2, Insightful

Slashdot is as much of a blog as I am a Egyptian gerbil. Slashdot links to stories that generate discussions. Slashdot is NOT about the people that create the posts, but about the people that comment here.

--
My Starcraft 2 Blog

Break through! by ultimad · 2007-06-03 07:27 · Score: 1

From TFA:
>>A search-engine tweak gave more weight to pages with phrases like "French Revolution" rather than pages that simply had both words.

So, now search engines are giving more importance to connected words rather than scattered words. How refreshing!

Re:Break through! by mestar · 2007-06-03 10:32 · Score: 1

Youmeanlikethis? Or-like-this?

The Man Behind Google's Ranking Algorithm by evilviper · 2007-06-03 07:27 · Score: 1

Come now, everyone knows there's no man behind Google's page rank. It's handled entirely by an army of birds.

http://www.google.com/technology/pigeonrank.html

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant

"Millions Of Black Boxes"? by aldheorte · 2007-06-03 07:34 · Score: 3, Interesting

Not sure about this:

"Google rarely allows outsiders to visit the unit, and it has been cautious about allowing Mr. Singhal to speak with the news media about the magical, mathematical brew inside the millions of black boxes that power its search engine."

I could see tens of thousands, maybe hundreds of thousands, but millions?

Re:"Millions Of Black Boxes"? by mestar · 2007-06-03 10:30 · Score: 1

I don't see any problems. Google's computers are powered by millions of tiny black rectangular box-shaped bateries.
Re:"Millions Of Black Boxes"? by asninn · 2007-06-03 22:21 · Score: 2, Informative

This is from a year ago (July 2006):

Google runs on hundreds of thousands of servers--by one estimate, in excess of 450,000--racked up in thousands of clusters in dozens of data centers around the world.

If this figure is accurate, a million boxen nowadays doesn't seem out of reach.

--
butter the donkey

Re:The most annoying thing about Google's results. by aengblom · 2007-06-03 08:08 · Score: 1

Slashdot is very very much a blog, which is a chronologically arranged web page. You're really bitching about personal home pages, which used to exist as regular ole' web pages, but now are blogs because they're easier to setup (no HTML required) and because the "chronological" nature of blogs works very well for journals.

If blogs didn't exist we'd just have more geocities pages getting lots of links.

--

So close and yet so far from the world's perfect ID number

Re:Google... by WilliamSChips · 2007-06-03 08:09 · Score: 1

In the time it took you to post that comment, David Banh finished medical school.

--
Please, for the good of Humanity, vote Obama.

How does it work by Anonymous Coward · 2007-06-03 08:11 · Score: 5, Informative

It is rather simple (I am an insider).

Google breaks pages in words. Then, for evey word it keeps a set which contains all the pages (by hash ID) that contain that word. A set is a data structure with O(1) lookup.

When you search for "linux+kernel" google just does the set union operation on the two sets.

Now a "word" is not just a word. In google sees that many people use the combination linux+kernel, a new word is created, the linux+kernel word and it has a set of all the pages that contain it. So when you search for linux+kernel+ppp we find the union of the linux+kernel set and the "ppp" set.

So every time you search, you make it better for google to create new words. And this is part of the power of this search engine. A new search engine will need some time to gather that empirical data.

Of course, there are ranks of sets. For example, for the word "ppp" there are, say, two sets. The pages of high rank that contain the word ppp, and the pages of low rank. When you search for ppp+chap, first you get the set union of the high rank sets of the two words, etc.

Now page rank has several criteria. Here are some:
well ranked site/domain, linked by well ranked page, document contains relevant words, search term is in the title or url, page rank not lowered by google emploee (level 1), page rank increased, etc.

It is not very difficult actually.

(posting AC for a reason).

Re:How does it work by Xoq+jay · 2007-06-03 09:36 · Score: 1

That's just awesome.. I never read that anywhere

that is cleverly simple actually!

well explained

Thank you!

--
God had a 7 day deadline... So he made the world in LISP
Re:How does it work by Anonymous Coward · 2007-06-03 20:47 · Score: 0

Yeah, well it's not rocket science. That's similar to how other search companies do it. One advantage google(or any other web search company for that matter) has is that almost everything is pre-computed and is just a lookup at query time. Google's not going and computing relevance/rank/etc and those other 200 signals for your query at runtime, it's all indexed with all possible combinations and is just a lookup at query time. You just keep updating this gigantic distributed across million machines-hash table asynchronously. That being said, it is not an easy task to maintain such a huge distributed system with the large query volume that they get and the uptime requirements they have.
Re:How does it work by Anonymous Coward · 2007-06-04 01:19 · Score: 0

Not a big deal.

There are two instances of the data: the public database which is the searchable database and the internal one. The googlebots update the internal database. Every once in a while we flush differences to the public database.

For example if word linux+kernel points to a set of documents that include it, after 24 hours if that set it modified we make a quick lock and make it point to the new set. The old set is freed by the reference counter.

Also, note that the set union is a parallelizable operation.
Re:How does it work by Anonymous Coward · 2007-06-04 05:42 · Score: 0

You don't really need a lock to point to the new set, a pointer swap followed by a delayed delete of the old set should do it. But, then again, that's almost zero cost anyway.

Algorithm? by Anonymous Coward · 2007-06-03 08:43 · Score: 0

This does not seem like an algorithm anymore, it is more of a heuristic. An algorithm can be proved to be correct and it's running time can be analyzed. An algorithm is provably correct whereas a heuristic just works for practical purposes.

Re:Algorithm? by mestar · 2007-06-03 10:20 · Score: 2, Insightful

So how do you call the "thing" that you use to impement a heuristic?

do no evil? by mestar · 2007-06-03 09:28 · Score: 1

"But last year, Mr. Singhal started to worry that Google's balance was off. When the company introduced its new stock quotation service, a search for "Google Finance" couldn't find it. After monitoring similar problems, he assembled a team of three engineers to figure out what to do about them."

But then they changed the algorithm and now Google Finance site is at the top.

Re:do no evil? by adpowers · 2007-06-03 10:00 · Score: 1

Are you saying you don't think the official website of a product should return as the first result in a search for that product?
Re:do no evil? by mestar · 2007-06-03 10:18 · Score: 1

I guess I'm saying that Google should not change its algorithm just to boost their own rankings.
Re:do no evil? by shird · 2007-06-03 11:22 · Score: 1

Exactly... they wouldn't bend over backwards to change their algorithm when someone elses product doesn't rank 1st for a search, especially if it was just new. Only when it's their own do they think 'something needs to change'.

If they just gave it a few months people would link to it, it would get older etc and its ranking would boost over time. That is the stock response they would give to anyone else that complained. I don't know why they think their algorithm has to list their product as first overnight just because its theirs.

--
I.O.U One Sig.
Re:do no evil? by mwvdlee · 2007-06-03 19:31 · Score: 1

Which is what you'd expect if your search query was "google finance", as the article states.

--
Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?

Single Quotes by Mana+Mana · 2007-06-03 10:20 · Score: 1

> "NEAR" keyword Isn't that what the single quote (') construct is for: 'widget offbeat'

IC's, perhaps by bill_mcgonigle · 2007-06-03 11:41 · Score: 1

I could see tens of thousands, maybe hundreds of thousands, but millions?

It's in Google's interest to have competitors think of it as bigger than it is.

So, if they count each IC on a mobo or drive controller, they probably do have millions of black boxes at Google, literally.

Alternately, they could be talking about algorithms, instances thereof, etc., though I like the black IC's better.

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)

I'm familiar with all this stuff by melted · 2007-06-03 15:22 · Score: 2, Interesting

And the thing that I want to know is how they evaluate the results. I actually do research in this space right now, and by far the most painful thing is evaluation of results. We have a system that automates most of the work, but there's still a lot of human involvement, and this limits the input dataset size and speed with which we can iterate the improvements.

Re:I'm familiar with all this stuff by martin-boundary · 2007-06-03 16:25 · Score: 3, Interesting

Good question. I agree with you that the article doesn't say anything valuable in this respect :(
When you say that your system is limited by human involvement, I presume you mean that implementing new features can have serious impact on the overall design (and therefore on testing procedures)? Feel free to not answer if you can't.
One thing I found interesting in the article is that Google's system sounds like it scales well. It reminded me of antispam architectures like Brightmail's (if memory serves), which have large numbers of simple heuristics which are chosen by an evolutionary algorithm. The point is that new heuristics can be added trivially without changing the architecture. I think their system used 10,000 when they described it a few years ago at an MIT spam conference. Adjustments were done nightly by monitoring spam honeypots.
I'd love to see better competition in the search engine space. I hope you succeed at improving your tech.

Page Rank is a HW assignment by ghoul · 2007-06-03 16:42 · Score: 1

Dude most of the things he talked about are taught in any decent Web Search or Machine Learning course. He is not disclosing any secrets and Page Rank is actually a 5 day homework assignment not a life's work. Google has gone far beyond Page Rank and Page Rank is just the dummy Google likes to wave about so that people are busy trying to beat Page Rank and not their real classifiers. And classifiers are dime a dozen. Tying them up with efficient network and database resources is Google's key contribution. Rest assured the reason Google is doing well is kept pretty secret. (Hint: Its the database and network algos which allow the maintenance of huge databases and indexes in a distributed manner). BTW If you dont believe Page Rank is a HW assignment look at Dr mooney's course page here http://www.cs.utexas.edu/~mooney/ir-course/

--
**Life is too short to be serious**

The most informative line in the article... by rmadhuram · 2007-06-03 17:17 · Score: 1

One of the New Yorkers munched on cake.

Old google data by able1234au · 2007-06-03 21:04 · Score: 1

I find it frustrating when i am searching for free market data, often available in the form of press releases or summaries of whitepapers. Things such as the size of a particular software or appliance market.

When i search Google usually gives me information from 2001, 2002, 2003 and it is hard to tell it i want only data from 2006/2007. The problem is that the sites that end up in the search constantly refresh the ads and links around their old stories which makes google think its fresh.

This was not a big problem when most of the internet content was no more than a year old. This problem will get worse unless Google is smarter about recognising that the core content of a page (magazine story, whitepaper etc) was written in 2002 and it is now out of date and should be further down the list than something written in 2007.

Found by furbearntrout · 2007-06-03 21:28 · Score: 1

Try allintitle: worked for me! It was on the first page. (Well, one link away; however the text "C+@" _was_ in the discription text)
Also try calico. (aka)

--
Crap. What did the new CSS do with the "Post anonymously" option??

MoneyRank by antikronos · 2007-06-03 21:38 · Score: 1

There is only one algorithm that really matters:
For each page in results
if page.HasAdwords=true and Not page.content=junk
page.MoneyRank= page.clickthroughrate * page.AdwordsValue
results.add page
else
Ignore
Endif
next page

results order by Moneyrank DESC

maybe its dream interpretation like ancient egypt by Anonymous Coward · 2007-06-06 09:37 · Score: 0

In ancient egypt, they used dream interpretation to study about intelligence and fate. Maybe that should change the algorythm to incorporate http://www.dreamcrowd.com/dream_interpretation

Slashdot Mirror

The Man Behind Google's Ranking Algorithm

115 comments