Slashdot Mirror


The Math Behind PageRank

anaesthetica writes "The American Mathematical Society is featuring an article with an in-depth explanation of the type of mathematical operations that power PageRank. Because about 95% of the text on the 25 billion pages indexed by Google consist of the same 10,000 words, determining relevance requires an extremely sophisticated set of methods. And because the links constituting the web are constantly changing and updating, the relevance of pages needs to be recalculated on a continuous basis."

14 of 131 comments (clear)

  1. 10,000 words by ambivalentduck · · Score: 5, Funny

    But 9,000 of those words are slang for parts of the human anatomy.  Go figure.

  2. PageRank doesn't seem to be based on keywords by dada21 · · Score: 4, Informative

    I have sites with a PR of 6, and I can tell you that they got that way because of inbound links from other sites. In fact, when other sites dropped those links, my PR dropped (to 5, and even to 4). Getting more inbound links brought the PR back.

    Think about those links, too. How often do you use common words in an HREF? I don't think there's a lot of weeding out of common words since the link to a site is usually either its name, or a description containing some important keywords.

    I love seeing these technoscientists think they understand PageRank, but just like TimeCube, they're way, way off.

    1. Re:PageRank doesn't seem to be based on keywords by Anonymous Coward · · Score: 3, Interesting

      It's not secret.

    2. Re:PageRank doesn't seem to be based on keywords by kimvette · · Score: 4, Funny
      --
      The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50
  3. Bad summary by Knights+who+say+'INT · · Score: 5, Interesting

    The article specifically says the PageRank eigenvector is only recalculated once a month, approximately. Even though Google uses some clever numerics to calculate the eigenvectors to a 25 billion by 25 billion matrix by iteration, it still takes several hours to finish.

    1. Re:Bad summary by martin-boundary · · Score: 4, Insightful
      It's nowhere near like that. A web matrix is very sparse, so if you did a true 25Bx25B matrix power iteration, you'd be multiplying zero by zero a gazillion times. Optimization is about not doing things you don't need to do, and optimizing PageRank is about figuring out clever ways to not do the full multiplication. Moreover, PageRank is calculated in parallel over a computer farm. Overall, you can expect a single iteration to take on the order of an hour, and you can expect around 50-80 iterations before Google gives up and says it's converged. You can also try and reuse the previous "converged" PageRank vector to cut down on the 50-80 iterations after you've crawled new pages.

      If google used a single computer to do all the work, and truly did 80*25B^2 operations, they'd be morons.

  4. Nouns maybe? by Bryansix · · Score: 3, Insightful

    It seems like it would be the nouns, pronouns, etc. that Google should be paying attention to. Who cares about all the verbs, adjectives, etc. that just muddy the indexing waters?

  5. Re:Does PageRank count? by Trieuvan · · Score: 4, Insightful

    The pagerank that's reported from toolbar is really old. Google never want to let you know the real number or it will be easy to spam ...

  6. Pagerank by Skythe · · Score: 5, Funny

    Because about 95% of the text on the 25 billion pages indexed by Google consist of the same 10,000 words, determining relevance requires an extremely sophisticated set of methods.

    They use a set of nested if-else statements
    *ducks*

  7. Re:I joke a lot on Slashdot, but serious question by Anonymous+Brave+Guy · · Score: 5, Informative

    The underlying idea behind page rank is pretty well-exposed at this point, and is described in TFA. Essentially, it's a big set of simultaneous equations: each incoming link to your page gets a score that is roughly the rank of the source page divided by the number of outgoing links on that page, and then the rank of your page is roughly the sum of the scores of all incoming links.

    Various fudge factors are introduced along the way. For example, if you break Google's rules about displaying the same content to bots as to humans, you can get slapped right down. More subtly, newly registered domains take a modest hit for a while. More nobody-knows-ly, Google's handling of redirects is unclear: information about exactly what adjustments are made is pretty scarce, and there's a lot of conjecture around. One thing that's pretty certain is that they penalise for duplicate content, which is why some webmasters do apparently unnecessary things like redirecting http://www.theircompany.com/ to http://theircompany.com/ or vice versa.

    So, if you want to get a page with a high rank yourself, then ideally you need would get many established, highly-ranked pages to link to your page and no others. In your example, all those Geocities sites wouldn't help a lot, because (a) they'd have negligible rank themselves, and (b) they'd be penalised for being new and lose some of that negligible rank before they even started. Many times negligible is still negligible, and so would be your target page's rank. OTOH, get a few links from university sites, big news organisations and the like, and your rank will suddenly be way up there. Alternatively, get a grass-roots movement going where a gazillion individuals with small personal sites link to you, and the cumulative effect will kick in.

    --
    If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
  8. Re:Pagerank is cool by silentounce · · Score: 5, Interesting

    Interestingly enough, google thinks so, too.

    Of course, yahoo has its own opinion.
     
    Although, altavista seems to almost agree. Check the second non-advertised result.
     
    I do find this amusing though. Third place, how humble.
     
    I didn't expect such interesting results. The site with the search term in its url was tops for av and yahoo, but not google. Yahoo ranked the wiki entry above google, but av reversed that decision, google of course thought itself was more important than the wiki. Google's own reference site was number one in its own search and near the top in the other two, but pagerank.net wasn't even in the top 10 for google's search. I'm not sure what conclusions can be drawn from all that, but it is definitely food for thought.

    --
    There are many tongues to talk, and but few heads to think. -Victor Hugo
  9. you forgot.. by gfody · · Score: 4, Funny

    ORDER BY adcost DESC

    --

    bite my glorious golden ass.
  10. Re:I joke a lot on Slashdot, but serious question by oni · · Score: 4, Interesting

    I notice many sites that do that and don't get slapped down - esp subscription sites.

    I wonder, if I changed my useragent to be whatever the googlebot reports itself to be - would I get by the registration screen on websites like the NYTimes??

  11. Only three articles about Google on one page? by colourmyeyes · · Score: 3, Funny

    I think we can get four or five tomorrow.

    --
    My grandmother used anecdotal evidence all the time, and she lived to be 120 years old.