Slashdot Mirror


The Math Behind PageRank

anaesthetica writes "The American Mathematical Society is featuring an article with an in-depth explanation of the type of mathematical operations that power PageRank. Because about 95% of the text on the 25 billion pages indexed by Google consist of the same 10,000 words, determining relevance requires an extremely sophisticated set of methods. And because the links constituting the web are constantly changing and updating, the relevance of pages needs to be recalculated on a continuous basis."

26 of 131 comments (clear)

  1. 10,000 words by ambivalentduck · · Score: 5, Funny

    But 9,000 of those words are slang for parts of the human anatomy.  Go figure.

  2. PageRank doesn't seem to be based on keywords by dada21 · · Score: 4, Informative

    I have sites with a PR of 6, and I can tell you that they got that way because of inbound links from other sites. In fact, when other sites dropped those links, my PR dropped (to 5, and even to 4). Getting more inbound links brought the PR back.

    Think about those links, too. How often do you use common words in an HREF? I don't think there's a lot of weeding out of common words since the link to a site is usually either its name, or a description containing some important keywords.

    I love seeing these technoscientists think they understand PageRank, but just like TimeCube, they're way, way off.

    1. Re:PageRank doesn't seem to be based on keywords by markov_chain · · Score: 2, Informative

      There has been a PageRank paper out there since 2000 or so, so it's not exactly a secret how it works. Basically an initial set of relevant pages is pulled from the database and ranked by doing some computation on a connectivity matrix. The trick is to come up with a good initial set; and unless they managed to implement an all-knowing oracle they probably do it by doing a keyword search. Here's where the article summary makes sense; if most pages have the same keywords, a keyword search is going to come up with an awfully large initial set.

      The article might have details, maybe someone who has actually read it can fill in :)

      --
      Tsunami -- You can't bring a good wave down!
    2. Re:PageRank doesn't seem to be based on keywords by Anonymous Coward · · Score: 3, Interesting

      It's not secret.

    3. Re:PageRank doesn't seem to be based on keywords by kimvette · · Score: 4, Funny
      --
      The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50
  3. Bad summary by Knights+who+say+'INT · · Score: 5, Interesting

    The article specifically says the PageRank eigenvector is only recalculated once a month, approximately. Even though Google uses some clever numerics to calculate the eigenvectors to a 25 billion by 25 billion matrix by iteration, it still takes several hours to finish.

    1. Re:Bad summary by The+Zon · · Score: 2, Funny

      Even though Google uses some clever numerics to calculate the eigenvectors to a 25 billion by 25 billion matrix by iteration, it still takes several hours to finish.

      Please. I can do that on paper in, like, five minutes.

      --
      Some attitudes replaced or by cgi optimizes
    2. Re:Bad summary by martin-boundary · · Score: 4, Insightful
      It's nowhere near like that. A web matrix is very sparse, so if you did a true 25Bx25B matrix power iteration, you'd be multiplying zero by zero a gazillion times. Optimization is about not doing things you don't need to do, and optimizing PageRank is about figuring out clever ways to not do the full multiplication. Moreover, PageRank is calculated in parallel over a computer farm. Overall, you can expect a single iteration to take on the order of an hour, and you can expect around 50-80 iterations before Google gives up and says it's converged. You can also try and reuse the previous "converged" PageRank vector to cut down on the 50-80 iterations after you've crawled new pages.

      If google used a single computer to do all the work, and truly did 80*25B^2 operations, they'd be morons.

  4. Nouns maybe? by Bryansix · · Score: 3, Insightful

    It seems like it would be the nouns, pronouns, etc. that Google should be paying attention to. Who cares about all the verbs, adjectives, etc. that just muddy the indexing waters?

    1. Re:Nouns maybe? by abshnasko · · Score: 2, Insightful

      Searching for pill and the pill should yield very different results. Yes nouns are more important, but articles and other words cannot be disregarded.

  5. I joke a lot on Slashdot, but serious question by CrazyJim1 · · Score: 2, Interesting

    I skimmed the article and didn't find what I wanted to find. If you make a webpage that you want ranked high, what do you do? Do you make 100 geocities accounts and provide links to your main website, or what? I'm just wondering this out of curiosity, not out of need.

    1. Re:I joke a lot on Slashdot, but serious question by Anonymous+Brave+Guy · · Score: 5, Informative

      The underlying idea behind page rank is pretty well-exposed at this point, and is described in TFA. Essentially, it's a big set of simultaneous equations: each incoming link to your page gets a score that is roughly the rank of the source page divided by the number of outgoing links on that page, and then the rank of your page is roughly the sum of the scores of all incoming links.

      Various fudge factors are introduced along the way. For example, if you break Google's rules about displaying the same content to bots as to humans, you can get slapped right down. More subtly, newly registered domains take a modest hit for a while. More nobody-knows-ly, Google's handling of redirects is unclear: information about exactly what adjustments are made is pretty scarce, and there's a lot of conjecture around. One thing that's pretty certain is that they penalise for duplicate content, which is why some webmasters do apparently unnecessary things like redirecting http://www.theircompany.com/ to http://theircompany.com/ or vice versa.

      So, if you want to get a page with a high rank yourself, then ideally you need would get many established, highly-ranked pages to link to your page and no others. In your example, all those Geocities sites wouldn't help a lot, because (a) they'd have negligible rank themselves, and (b) they'd be penalised for being new and lose some of that negligible rank before they even started. Many times negligible is still negligible, and so would be your target page's rank. OTOH, get a few links from university sites, big news organisations and the like, and your rank will suddenly be way up there. Alternatively, get a grass-roots movement going where a gazillion individuals with small personal sites link to you, and the cumulative effect will kick in.

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    2. Re:I joke a lot on Slashdot, but serious question by TheLink · · Score: 2, Interesting

      "if you break Google's rules about displaying the same content to bots as to humans"

      I notice many sites that do that and don't get slapped down - esp subscription sites. And seems Google doesn't cache those, so its probably collusion.

      You see the keywords and paragraphs in the search, but click on it you get a login page.

      They should have to pay a special rate be marked differently from the other search results. It's a waste of time otherwise.

      --
    3. Re:I joke a lot on Slashdot, but serious question by oni · · Score: 4, Interesting

      I notice many sites that do that and don't get slapped down - esp subscription sites.

      I wonder, if I changed my useragent to be whatever the googlebot reports itself to be - would I get by the registration screen on websites like the NYTimes??

    4. Re:I joke a lot on Slashdot, but serious question by l0cust · · Score: 2, Insightful

      Thanks for the informative post. I have one question though. How does it help find the relevant information unless that information just happens to be on a popular page too? What I mean to say is that the idea behind grading/filtering systems like PageRank is to provide the most relevant information about the thing you are trying to search on the net. Now suppose Mr. A is looking for some obscure Indian text written in Sanskrit and Mr. B has (recently or not) put up a website with that text as one of the contents but its not a popular blogsite, nor a mainstream ebook source site etc. And there are a gazillion hugely popular sites out there which mention that particular text while talking about a totally different book or text.

      So it means the only way that Mr. A will come across Mr. B's website is if he kept on looking for 100s of result pages or if he just chanced upon it via something he read about earlier. Doesn't it defeat the purpose of making the searches more relevant, specially since lots of webmasters actively use PageRanking system to get better ranking on the search index. Where does that leave the people with worthwhile content but not much popular backing?

      (I would like to clear that I am not trying to knock this system or anything, I am just curious about the implications for small-but-good-content website owners)

      --
      Politicians and Pedophiles: Two groups of exploitive bastards who are most dangerous when they're thinking of children.
    5. Re:I joke a lot on Slashdot, but serious question by XorNand · · Score: 2, Informative

      As pointed out, the Times site isn't fooled, but there are a good many out there that are fooled. Sometimes if you ever do a Google search, one of the results will contain a keyword or two. However, when you click on the link, you'll find yourself redirected to a subscription page. Useragent spoofing can frequently show you the same page that Google indexed.

      If you're a FF user, grab the Useragent Switcher extension and add in a UA of "Mozilla/5.0 (compatible; googlebot/2.1; +http://www.google.com/bot.html)". You'll then be two clicks away from seeing what was previously registration-only.

      --
      Entrepreneur : (noun), French for "unemployed"
    6. Re:I joke a lot on Slashdot, but serious question by suggsjc · · Score: 2, Interesting
      Here is an email with associated response I received from Google on roughly this topic.

      This is a very general question. I'm creating a website. It is going to be a blogging platform. Obviouslly, the content of the site(s) is the most important thing. I've already started making the content of my site dynamic in the sense that I tailor it to the requesting agent (via the user-agent header). My intention for doing this is to make sure that the content renders correctly for *any* browser that accesses the site. I've built the site modularly, so tailoring the content to the requesting agent isn't a big deal. However that leads me to my question(s) and the reason I am emailing you? FYI, I have no ulterior motives for being able to tailor my content, other than making sure that the user get the most usefull information.

      That said. When a "bot" (ie your crawler) accesses my site. I'm going to treat it like I would a mobile browser. I'm going to give the minimal markup and the css will be very simple. I'm going to make sure that my content comes before my navigation, advertisements, etc in the source.

      My real question is does the fact that I'm presenting you the content of my site differently from other browsers make a difference? If so, (then again my reasoning is to make sure that my users get the correct content) how do I prevent this from hurting me in your rankings? If not, then how do you protect yourself from other sites taking advantage of this "hole"? Meaning I could make my site appear legit when I knew your bots were crawling me, but give "alternate" content when real users were visiting.

      Last question. Do you have any idea when your ads will work correctly with xhtml?
      Hope you weren't expecting a straightforward answer (like I was), because here is what I got back

      Hello Jonathon,

      Thanks for your email about the website you're creating.

      First, since you asked when our program will support XHTML, I wanted to
      let you know that we're unable to say if we'll support XHTML pages in the
      future.

      While the AdSense team isn't able to answer your questions about your
      site's ranking in the Google search index, I'd recommend visiting
      http://www.google.com/support/webmasters . I also wanted to let you know
      that our advertising programs are independent of our search results.
      Participation in AdWords and AdSense doesn't affect inclusion or ranking
      in the Google search index.

      I've also included answers to some of the most common questions AdSense
      publishers have asked.

      How can I improve my site's ranking?
      Answer:
      http://www.google.com/support/w ebmasters/bin/answer.py?answer=34432&hl=en_US

      H ow do I add my site to Google's search results?
      Answer:
      http://www.google.com/support/w ebmasters/bin/answer.py?answer=34397&hl=en_US

      M y site is no longer included in the search results. What happened?
      Answer:
      http://www.google.com/support/ webmasters/bin/answer.py?answer=34443&hl=en_US

      Why doesn't my site show up for a specific keyword?
      Answer:
      http://www.google.com/support/w ebmasters/bin/answer.py?answer=34434&hl=en_US

      F or additional questions, I'd encourage you to visit the AdSense Help
      Center (http://www.google.com/adsense_help), our complete resource center
      for all AdSense topics. Alternatively, feel free to post your question on
      the forum just for AdSense publishers: the AdSense Help Group
      (http://groups.google.com/group/adsense-hel p).

      Sincerely,

      Jake
      The Google AdSense Team
      --
      When I have a kid, I want to put him in one of those strollers for twins and then run around the mall looking frantic.
  6. Does PageRank count? by matr0x_x · · Score: 2, Interesting

    As a self proclaimed SEO expert - I honestly don't believe PageRank counts nearly as much as it did a few years ago! You'll find lots of PR5 sites ahead in the SERPS of PR9 sites!

    --
    LINUX ONLINE POKER: Linux Poker
    1. Re:Does PageRank count? by Trieuvan · · Score: 4, Insightful

      The pagerank that's reported from toolbar is really old. Google never want to let you know the real number or it will be easy to spam ...

    2. Re:Does PageRank count? by Anonymous Coward · · Score: 2, Funny

      Concentrate on SERPs, not PR, ASAP for SEO on the WWW

      I searched on Google but I cannot find what "on", "not", "for" and "the" mean...

  7. Pagerank by Skythe · · Score: 5, Funny

    Because about 95% of the text on the 25 billion pages indexed by Google consist of the same 10,000 words, determining relevance requires an extremely sophisticated set of methods.

    They use a set of nested if-else statements
    *ducks*

  8. Re:Pagerank is cool by silentounce · · Score: 5, Interesting

    Interestingly enough, google thinks so, too.

    Of course, yahoo has its own opinion.
     
    Although, altavista seems to almost agree. Check the second non-advertised result.
     
    I do find this amusing though. Third place, how humble.
     
    I didn't expect such interesting results. The site with the search term in its url was tops for av and yahoo, but not google. Yahoo ranked the wiki entry above google, but av reversed that decision, google of course thought itself was more important than the wiki. Google's own reference site was number one in its own search and near the top in the other two, but pagerank.net wasn't even in the top 10 for google's search. I'm not sure what conclusions can be drawn from all that, but it is definitely food for thought.

    --
    There are many tongues to talk, and but few heads to think. -Victor Hugo
  9. you forgot.. by gfody · · Score: 4, Funny

    ORDER BY adcost DESC

    --

    bite my glorious golden ass.
  10. Only three articles about Google on one page? by colourmyeyes · · Score: 3, Funny

    I think we can get four or five tomorrow.

    --
    My grandmother used anecdotal evidence all the time, and she lived to be 120 years old.
  11. It's the World' s Largest Matrix Computation by MadMagician · · Score: 2, Informative

    For a different, somewhat more technical, but more succint discussion, Cleve Moler [of Matlab fame] wrote another view of this topic, about 5 years ago.

    The math is the same, of course, but two points of view may provide a greater sense of perspective. So to speak. And Cleve is always worth listening to.

  12. Pages that don't exist anymore by namco · · Score: 2, Interesting

    I've seen links on google searches that don't exist anymore but were ranked highly when they DID exist and still exist in the top 10 of the query. What happens to those? Do they stay at their ranking till they get overtaken by other more popular pages on the same search? Get their ranking slowly reduced because they don't exist?