Slashdot Mirror


Computing PageRank on your PC?

An anonymous reader writes "A group of CS researchers of the University of Milan has found a way to compress web graphs at 3 bits per link, and to access them in compressed form. They provide data sets representing real snapshots of portions of the web with one hundred million nodes and 1 billion links. You just need some bandwidth to download a few hundred megabytes of data, and you can compute PageRank with your PC. All the code involved is GPL'd, and the data are public: everybody can grok PageRank now!"

34 of 186 comments (clear)

  1. The major thing missing from Mozilla by Anonymous Coward · · Score: 5, Interesting

    Is a way to look at Google's pagerank. That's the only real thing the IE Google toolbar has over the Mozilla alternative.

    1. Re:The major thing missing from Mozilla by Anonymous Coward · · Score: 3, Informative

      http://googlebar.mozdev.org

    2. Re:The major thing missing from Mozilla by wherley · · Score: 3, Informative

      Tried it...but it provides no pagerank. They say:
      "We currently have no plans to implement pagerank"

      Still - a cool addition to mozilla.

  2. This sounds cool.. by xchino · · Score: 5, Funny

    Now if I can just think of a reason why I would need this..

    --
    Everyone is entitled to their own opinion. It's just that yours is stupid.
    1. Re:This sounds cool.. by Daniel_Staal · · Score: 5, Funny
      Now if I can just think of a reason why I would need this..

      And you call yourself a geek. *Sigh*.

      It doesn't matter why you need it. It's technical, GPLed, and has to do with Google. That's all the reason you need.

      --
      'Sensible' is a curse word.
    2. Re:This sounds cool.. by (trb001) · · Score: 5, Funny

      It's technical, GPLed, and has to do with Google

      It's a geek hattrick!

  3. Dumb Question: by Xesdeeni · · Score: 5, Interesting

    What's Page Rank? Does this indicate how often my page is visited?

    Xesdeeni

    1. Re:Dumb Question: by Chris_Stankowitz · · Score: 5, Informative

      Do you mods ever stop to wonder if this guy could have been asking a legit question? Its possible he doesn't know. Also possible that others don't. I know...I know..., this is /. how could he not know right. It is still very possible. I'm not saying he should have been modded up, but by modding him down someone may miss the chance to read his post and reply to it with an intelligent answer. All of that being said. I would answer his question. But now that I think about, I'm not sure what it is. I 'think' I know. But, I think he and I are in the same boat. I also thought about posting this as an AC, but I won't. Then surley someone will just think that it was the original poster posting as an AC. He may be trolling. He may not be. It won't hurt to answer the question.

    2. Re:Dumb Question: by Anonymous Coward · · Score: 5, Funny

      Jesus, you created a second account just to defend yourself!

    3. Re:Dumb Question: by sig+cop · · Score: 3, Funny
      I didn't know what is it either.

      Mod parent and grandparent and great-grandparent up.

      Also, mod parents children up.

      Also, mod great-great-grandparents great-great-granddaughters up.

      Also, say up unto them verily, that the mod of the parent will be cast down the generations to be a mod on the children, and on the children's children, and on the children's children's chilluns.

      And also, mod up the nephews of the parents of the sibilings of the grandparent for though they be trolls or flaimbait, they are righteous in the eyes of the moderators.

      And thou shalt visit the mods onto the descendents on through the generations, for I, your Mod, have smote upon thee a mod pestilence that shalt not be lifted until the second coming of the JonKats.

      Thanks be to Mod, Amen

  4. Tee Hee by teamhasnoi · · Score: 5, Funny
    I bet the Searchking is steaming right about now...

    "Finally, proof!!"

  5. Some webmasters/SEO's are obsessive by Anonymous Coward · · Score: 5, Interesting

    If Google tweaks one thing, causing result 97 to shift to result 98, they notice. They'd be doing this daily to check on their pages.

  6. Which sites are the Root(s)? by amembleton · · Score: 5, Interesting

    When these Web Graph or Page Rank things are drawn up which sites do they use as the roots?

    I mean they've got to start with some site(s) and then go through each link from there.

    1. Re:Which sites are the Root(s)? by warkda+rrior · · Score: 5, Informative

      It is a graph, not a tree, so there is no one root. Maybe you are looking for the seed site, i.e. the first site added to the webgraph they construct. You can choose any site you prefer, although something well-connected is better. It seems to me that Yahoo! would be a good starting point.

      --
      You need to install an RTFM interface.
  7. beyond PageRank... by rfischer · · Score: 3, Interesting

    ... I would be interested in how the links change over time. Maybe take a new snapshot every day or week, see the web evolve.

  8. PageRank is part of Google's algo by Anonymous Coward · · Score: 5, Informative

    It's basically how well linked to your page is, and how well linked to the pages linking to you are, and so on. It's an advanced form of link popularity. The idea is that the more people that link to something, the more influential/important it is. Some sites have high PageRanks of 10 (like Google), while Slashdot is something like an 8. Many pages are in the 4-6 range. Every link you create is like a "vote" for another web page.

    1. Re:PageRank is part of Google's algo by e2d2 · · Score: 4, Informative

      Google's PageRank was actually named after Larry Page, the creator of their system for ranking pages. Pun was obviously intended.

  9. I can see it now... by AyeRoxor! · · Score: 5, Funny

    "[...] even on a PC with as little as 256 Mbytes of RAM."

    Somewhere in 1980, milk shoots out of Bill Gates' nose for no apparent reason.

  10. Google with feedback by Sanity · · Score: 3, Interesting
    Doesn't Google have a patent on PageRank?

    Anyway, forgive the opportunism, but this is reasonably on-topic. Last weekend I set myself the ambitious task of improving on Google. I came up with a Google front-end which allows you to give feedback on the quality of search results, and thus refine your search. I could really use people's help to test it out - you can find it here. Feedback would really be appreciated.

    1. Re:Google with feedback by YoJ · · Score: 3, Insightful
      The whole point of patents is to encourage inventors to publish their inventions in a safe way. In some respects, PageRank is a good example of how the system is supposed to work. They publish the algorithm, people examine it and experiment further with it, but the inventors still have protection against people ripping off their work.

      The problem is that the GPL does not allow distribution of patent-encumbered technology. The authors of the code in question have every right to release their code with whatever license they want (I believe this is a free-speech issue, especially since the purpose of releasing the code is for doing research). People who receive their code may not use the code in a way that violates the patent, and in addition may not redistribute the code at all (since it would violate the GPL).

      The other issue is that PageRank is really a mathematical formula, and as such is unpatentable. What they actually patented is an algorithm for computing PageRank. If someone finds another way of computing the same formula, I think the patent holders would have a very hard time showing infringement.

  11. This is good, but... by Prince_Ali · · Score: 5, Funny
    This is good, but I'd rather have the google cache compressed to 3 bits per page.

    "I'll be there in a minute! I'm downloading the Internet!"

  12. Google patents? by PaulBu · · Score: 4, Interesting

    All the code involved is GPL'd, and the data are public: everybody can grok PageRank now!

    GPL'd? Hmm, I thought that Google did patent the PageRank algorithm (correct me if I am wrong), so re-implementing THEIR algorithm even more efficiently would be incompatible with GPL. OTOH, if it is not THEIR algorithm, it can not be called 'PageRank'
    Oh, the evils of software patents...
    Paul B.

    1. Re:Google patents? by JoeBuck · · Score: 3, Interesting

      Google hasn't exactly patented the algorithm for all uses, and no court has determined that the code infringes the patent, and software patents aren't valid in most countries, so it's not clear whether or not there is any compatibility.

      It would seem that anyone who uses the code to build a search engine would be infringing, but even that is something that lawyers can argue about.

  13. Doesn't actually calculate PageRank? by Vultan · · Score: 5, Informative

    As best as I can tell from the website, the API is only for storing and interacting with a large graph. Nothing there is actually involved with PageRank. You could use this API presumably to write your own PageRank code, but to say "everybody can grok PageRank now!" is misleading at best.

    Moreover, IANAL, but isn't the PageRank algorithm patented by Google? Wouldn't this prevent anyone from releasing GPL code that computes PageRank?

  14. Re:why is rank/rating necessary? by TopShelf · · Score: 4, Funny

    You get sort of a self reinforcing cycle of wankage...

    For a second there I thought you were just talking like Elmer Fudd! "wating and wanking incwease the welevance of pagewanking..."

    --
    Stop by my site where I write about ERP systems & more
  15. has to be said by madHomer · · Score: 4, Funny

    It's just not the same without the pigeons...

  16. Proof of concept only by Saganaga · · Score: 5, Informative

    I think this project is really just a proof of concept. As another post pointed out, to make this really useful you'd need to regularly update your local data set, which isn't very practical for most people.

    Also, if the downloadable dataset only covers a small portion of the web, how can this system's utility really compare to Google's?

    That said, I think computer science proof-of-concept type project are very useful and serve a valuable purpose in getting the ideas out there for others to improve upon.

  17. Google's algorithms have changed quite a bit by HiKarma · · Score: 3, Insightful

    Since their original papers, according to all posted reports. So I don't think you're really going to get the exact google number from a basic algorithm and this data set.

    They also use terms that appear in links as a major key in ranking searches.

    (Among other things.)

    Not that it is not interesting to see these rankings, and note the most widely linked to sites on the net.

    Which, by the way, after the obvious winners like Yahoo, include Adobe and Real networks, which have gotten immense numbers of sites to link to them with "Get acrobat reader" style links.

    I've often wondered if the makeashorterlink and tinyurl folks are doing it just for the googlejuice.

    In reverse, many sites now use javascript links in order to preserve their googlejuice.

    Very much a heisenberg phenomenon here.

  18. Re:What a mess by dpbsmith · · Score: 4, Informative

    Just in case this wasn't an implied rhetorical question... the term, as far as I know, was invented by Robert Heinlein in his novel _Stranger in a Strange Land,_ where it is an expression used by Martians. It literally means "to drink," but the Martians use it to mean an understanding that is both very deep and very complete.

  19. I wonder... by crashnbur · · Score: 4, Interesting

    ...how this can be used to discover the percentage of broken links on the web at any given moment in time.

  20. Ask and ye shall receive... by Theaetetus · · Score: 4, Informative
    and I say "Dammit, where are all the pretty pictures."

    Here (for free)

    Here too (for free)

    This one too (for free)

    This one also (free)

    And don't forget this classic ($30 poster)

    -T

  21. Re:can anyone explain what a web graph is? by lordbrain · · Score: 5, Informative

    In a graph is made up of two things, edges and vertices.

    In a web graph, vertices are webpages and edges are hyperlinks.

    PageRank determines how many incoming edges a vertex has. Given the nature of the web, this is a nontrivial problem because a vertex only knows its outgoing edges.

    The assumption for PageRank is that the more incoming edges a vertex has, the more popular it is. So you would use this to figure out how popular a particular vertex is.

    Given this you could do like Google and combine it with a search engine to prioritize the results.

    --

    Thank you. Thank you. Please no applause; just throw money
  22. Re:does this mean... by JamesOfTheDesert · · Score: 4, Funny
    ... pear search

    ... to find the fruits of your labor?

    What a grape idea! Orange you glad you thought of it?

    .

    .

    .

    Ok. Groan fest is over.

    --

    Java is the blue pill
    Choose the red pill
  23. Re:why is rank/rating necessary? by baka_boy · · Score: 4, Interesting

    I really shouldn't rise to this bait, but I can't resist: yes, given the choice between those networks, I would choose PBS. Just as I would take a non-profit-driven Internet, public radio over Clear Channel and its ilk, and community mesh wireless networks over 3G mobile phone service.

    Google has been, so far at least, a rare exception in the world of privatized communications utilities, by consistently showing a amazing lack of intention to lock people into their service, using either exclusivity agreements of some sort or the simple expedient of proprietary technology (i.e., "increase your PageRank by 10% if you support new encrypted GoogleML tags on your site!"). Nothing is permanent, though, and as we all know, single points of failure are a no-no.

    So, to bring all this back somewhere in the general neighborhood of the main story: further distributing the capability to build "mini-Googles", or specialized, community-maintained (but still fairly large-scale in terms of number of pages and links indexed) search tools is very interesting, and a useful body of technology to perpetuate.

    Or, even more generally, the technology needed to do large-scale storage, analysis, and manipulation of directed graph structures is a very useful tool. Software analysis often relies heavily on large graphs showing dependencies, caller-callee relationships, variable accesses, etc., as do any number of AI subdomains like knowledge representation and planning systems.