Slashdot Mirror


The Math Behind PageRank

anaesthetica writes "The American Mathematical Society is featuring an article with an in-depth explanation of the type of mathematical operations that power PageRank. Because about 95% of the text on the 25 billion pages indexed by Google consist of the same 10,000 words, determining relevance requires an extremely sophisticated set of methods. And because the links constituting the web are constantly changing and updating, the relevance of pages needs to be recalculated on a continuous basis."

7 of 131 comments (clear)

  1. Nouns maybe? by Bryansix · · Score: 3, Insightful

    It seems like it would be the nouns, pronouns, etc. that Google should be paying attention to. Who cares about all the verbs, adjectives, etc. that just muddy the indexing waters?

    1. Re:Nouns maybe? by abshnasko · · Score: 2, Insightful

      Searching for pill and the pill should yield very different results. Yes nouns are more important, but articles and other words cannot be disregarded.

  2. Re:10,000 words by Anonymous Coward · · Score: 0, Insightful

    Dear The Zon,

    You are not funny. Quit your lame endless attempts at humor.

    The Other 1.05 million readers

  3. Re:Does PageRank count? by Trieuvan · · Score: 4, Insightful

    The pagerank that's reported from toolbar is really old. Google never want to let you know the real number or it will be easy to spam ...

  4. Re:Bad summary by martin-boundary · · Score: 4, Insightful
    It's nowhere near like that. A web matrix is very sparse, so if you did a true 25Bx25B matrix power iteration, you'd be multiplying zero by zero a gazillion times. Optimization is about not doing things you don't need to do, and optimizing PageRank is about figuring out clever ways to not do the full multiplication. Moreover, PageRank is calculated in parallel over a computer farm. Overall, you can expect a single iteration to take on the order of an hour, and you can expect around 50-80 iterations before Google gives up and says it's converged. You can also try and reuse the previous "converged" PageRank vector to cut down on the 50-80 iterations after you've crawled new pages.

    If google used a single computer to do all the work, and truly did 80*25B^2 operations, they'd be morons.

  5. Re:I joke a lot on Slashdot, but serious question by l0cust · · Score: 2, Insightful

    Thanks for the informative post. I have one question though. How does it help find the relevant information unless that information just happens to be on a popular page too? What I mean to say is that the idea behind grading/filtering systems like PageRank is to provide the most relevant information about the thing you are trying to search on the net. Now suppose Mr. A is looking for some obscure Indian text written in Sanskrit and Mr. B has (recently or not) put up a website with that text as one of the contents but its not a popular blogsite, nor a mainstream ebook source site etc. And there are a gazillion hugely popular sites out there which mention that particular text while talking about a totally different book or text.

    So it means the only way that Mr. A will come across Mr. B's website is if he kept on looking for 100s of result pages or if he just chanced upon it via something he read about earlier. Doesn't it defeat the purpose of making the searches more relevant, specially since lots of webmasters actively use PageRanking system to get better ranking on the search index. Where does that leave the people with worthwhile content but not much popular backing?

    (I would like to clear that I am not trying to knock this system or anything, I am just curious about the implications for small-but-good-content website owners)

    --
    Politicians and Pedophiles: Two groups of exploitive bastards who are most dangerous when they're thinking of children.
  6. Re:Shameless plug for my uni by Anonymous Coward · · Score: 1, Insightful

    Blah. Ugly red clothes... Go Bears!