Computing PageRank on your PC?
An anonymous reader writes "A group of CS researchers of the University of Milan has found a way to compress web graphs at 3 bits per link, and to access them in compressed form. They provide data sets representing real snapshots of portions of the web with one hundred million nodes and 1 billion links. You just need some bandwidth to download a few hundred megabytes of data, and you can compute PageRank with your PC. All the code involved is GPL'd, and the data are public: everybody can grok PageRank now!"
If these are snapshots then you'll need to keep downloading them for your Page Rank system to be up to date. The web is constantly changing and therefore so is Page Rank. I can't see having a data set on your computer being all that usefull as it'll soon expire.
It would be far better to be able to link to a data set via XML and query it. That way you would have live upto the minute Page Ranks. I know that Google already does a live Page Rank system, but being able to access it and query it would be usefull.
It could have receded back into the depths and maintained quality but it put page-ranking first, attempting to attract and contain a particular audience.
I disagree. In case you haven't noticed, the title of the /. front page is "News for Nerds, Stuff that Matters." So, of course /. is attracting a particular audience. That's a Good Thing.
Target audience is one of the most important decisions when designing a web site. "Good info" is a subjective concept. What's good to you is not necessarily good to me. But, chances are if I search for something that I'm looking for, PageRank can provide a sense of the more authoratative pages for that subject.
Also, putting stuff up for popularity's sake is a great reason to put something up. If I didn't want my employer's site to be seen, I wouldn't have put it up there. Attracting eyeballs is the only way to get good info. The more eyeballs, the better the accuracy of information. Why do you think peer review is such a big deal in scientific arenas (and it is, as I know from working for a big-name medical school)? If I was a scientist reviewing another scientist's work, then I would look at the writing aspects of the work. A little bit of style often makes information more credible to others. Don't ask me why, just know that it's human nature.
The problem is that the GPL does not allow distribution of patent-encumbered technology. The authors of the code in question have every right to release their code with whatever license they want (I believe this is a free-speech issue, especially since the purpose of releasing the code is for doing research). People who receive their code may not use the code in a way that violates the patent, and in addition may not redistribute the code at all (since it would violate the GPL).
The other issue is that PageRank is really a mathematical formula, and as such is unpatentable. What they actually patented is an algorithm for computing PageRank. If someone finds another way of computing the same formula, I think the patent holders would have a very hard time showing infringement.
Since their original papers, according to all posted reports. So I don't think you're really going to get the exact google number from a basic algorithm and this data set.
They also use terms that appear in links as a major key in ranking searches.
(Among other things.)
Not that it is not interesting to see these rankings, and note the most widely linked to sites on the net.
Which, by the way, after the obvious winners like Yahoo, include Adobe and Real networks, which have gotten immense numbers of sites to link to them with "Get acrobat reader" style links.
I've often wondered if the makeashorterlink and tinyurl folks are doing it just for the googlejuice.
In reverse, many sites now use javascript links in order to preserve their googlejuice.
Very much a heisenberg phenomenon here.
The Google toolbar for IE has to ask google.com for the PageRank of each page you view, via XML-RPC. One of the fields in the XML-RPC request is a checksum. Without that checksum, google.com rejects the request. So it's just a matter of finding out how the toolbar calculates the checksum based on your URL. Then you could write a standalone (or Mozilla-based) tool for fetching PageRanks.