Computing PageRank on your PC?
An anonymous reader writes "A group of CS researchers of the University of Milan has found a way to compress web graphs at 3 bits per link, and to access them in compressed form. They provide data sets representing real snapshots of portions of the web with one hundred million nodes and 1 billion links. You just need some bandwidth to download a few hundred megabytes of data, and you can compute PageRank with your PC. All the code involved is GPL'd, and the data are public: everybody can grok PageRank now!"
Is a way to look at Google's pagerank. That's the only real thing the IE Google toolbar has over the Mozilla alternative.
Now if I can just think of a reason why I would need this..
Everyone is entitled to their own opinion. It's just that yours is stupid.
What's Page Rank? Does this indicate how often my page is visited?
Xesdeeni
"Finally, proof!!"
If Google tweaks one thing, causing result 97 to shift to result 98, they notice. They'd be doing this daily to check on their pages.
I wonder if this goes as it's planned to, is it the end of search engines, and the beginning of peap to pear search?
The IT section color scheme sucks.
When these Web Graph or Page Rank things are drawn up which sites do they use as the roots?
I mean they've got to start with some site(s) and then go through each link from there.
There is more links than that just at Microsoft's Support page. Although I don't know if you can call them links if they only send you around in a cirlce.
... I would be interested in how the links change over time. Maybe take a new snapshot every day or week, see the web evolve.
It's basically how well linked to your page is, and how well linked to the pages linking to you are, and so on. It's an advanced form of link popularity. The idea is that the more people that link to something, the more influential/important it is. Some sites have high PageRanks of 10 (like Google), while Slashdot is something like an 8. Many pages are in the 4-6 range. Every link you create is like a "vote" for another web page.
"[...] even on a PC with as little as 256 Mbytes of RAM."
Somewhere in 1980, milk shoots out of Bill Gates' nose for no apparent reason.
This is what we need to talk about at our little IRC chat session tonight, commander.
web graphs at 3 bits per link, that's a paddling...
compute PageRank with your PC, that's a paddling....
groking PageRank, you better believe that's a paddling...
Anyway, forgive the opportunism, but this is reasonably on-topic. Last weekend I set myself the ambitious task of improving on Google. I came up with a Google front-end which allows you to give feedback on the quality of search results, and thus refine your search. I could really use people's help to test it out - you can find it here. Feedback would really be appreciated.
"I'll be there in a minute! I'm downloading the Internet!"
Slashdotter are stupid and biased.
If these are snapshots then you'll need to keep downloading them for your Page Rank system to be up to date. The web is constantly changing and therefore so is Page Rank. I can't see having a data set on your computer being all that usefull as it'll soon expire.
It would be far better to be able to link to a data set via XML and query it. That way you would have live upto the minute Page Ranks. I know that Google already does a live Page Rank system, but being able to access it and query it would be usefull.
All the code involved is GPL'd, and the data are public: everybody can grok PageRank now!
GPL'd? Hmm, I thought that Google did patent the PageRank algorithm (correct me if I am wrong), so re-implementing THEIR algorithm even more efficiently would be incompatible with GPL. OTOH, if it is not THEIR algorithm, it can not be called 'PageRank'
Oh, the evils of software patents...
Paul B.
As best as I can tell from the website, the API is only for storing and interacting with a large graph. Nothing there is actually involved with PageRank. You could use this API presumably to write your own PageRank code, but to say "everybody can grok PageRank now!" is misleading at best.
Moreover, IANAL, but isn't the PageRank algorithm patented by Google? Wouldn't this prevent anyone from releasing GPL code that computes PageRank?
PageRank is patented, isn't it?
You get sort of a self reinforcing cycle of wankage...
For a second there I thought you were just talking like Elmer Fudd! "wating and wanking incwease the welevance of pagewanking..."
Stop by my site where I write about ERP systems & more
I don't think this is pagerank, reading the link, this looks more like another rating system that is similar to pagerank. It's great for study, but I don't think reading through the source and finding ways to 'trick' this algorithm will necessarily work on Google. Correct me if I'm wrong someone.
Go away, or I will replace you with a very small shell script.
It's just not the same without the pigeons...
...it isn't on a fat pipe, so please understand if its slow.
I think this project is really just a proof of concept. As another post pointed out, to make this really useful you'd need to regularly update your local data set, which isn't very practical for most people.
Also, if the downloadable dataset only covers a small portion of the web, how can this system's utility really compare to Google's?
That said, I think computer science proof-of-concept type project are very useful and serve a valuable purpose in getting the ideas out there for others to improve upon.
It could have receded back into the depths and maintained quality but it put page-ranking first, attempting to attract and contain a particular audience.
I disagree. In case you haven't noticed, the title of the /. front page is "News for Nerds, Stuff that Matters." So, of course /. is attracting a particular audience. That's a Good Thing.
Target audience is one of the most important decisions when designing a web site. "Good info" is a subjective concept. What's good to you is not necessarily good to me. But, chances are if I search for something that I'm looking for, PageRank can provide a sense of the more authoratative pages for that subject.
Also, putting stuff up for popularity's sake is a great reason to put something up. If I didn't want my employer's site to be seen, I wouldn't have put it up there. Attracting eyeballs is the only way to get good info. The more eyeballs, the better the accuracy of information. Why do you think peer review is such a big deal in scientific arenas (and it is, as I know from working for a big-name medical school)? If I was a scientist reviewing another scientist's work, then I would look at the writing aspects of the work. A little bit of style often makes information more credible to others. Don't ask me why, just know that it's human nature.
It uses Slashdot as a root, of course. ;)
Seriously, I don't know. Here's a page on how Google works though.
http://www.google.com/technology/index.html
Though after a quick read, I can see why ...
Maybe take a new snapshot every day or week, see the web evolve.
How much time do you think it's needed to take a snapshot of the Web? Most certainly much longer than a day or even a week. My bet would be several months at the very least.
-- Repeat with me: "There is no right to profits".
Sure, now everybody can grok PageRank, but I, for the life of me, cannot grok grok.
Cyde Weys Musings - Scrutinizing the inscrutable
It doesn't mean a lot to me when my brother says he is going to double his efforts to find a job. This is especially true if you know my brother.
Since their original papers, according to all posted reports. So I don't think you're really going to get the exact google number from a basic algorithm and this data set.
They also use terms that appear in links as a major key in ranking searches.
(Among other things.)
Not that it is not interesting to see these rankings, and note the most widely linked to sites on the net.
Which, by the way, after the obvious winners like Yahoo, include Adobe and Real networks, which have gotten immense numbers of sites to link to them with "Get acrobat reader" style links.
I've often wondered if the makeashorterlink and tinyurl folks are doing it just for the googlejuice.
In reverse, many sites now use javascript links in order to preserve their googlejuice.
Very much a heisenberg phenomenon here.
...how this can be used to discover the percentage of broken links on the web at any given moment in time.
Your feedback is local to your search, it doesn't affect other people's searches.
and I say "Dammit, where are all the pretty pictures."
I get this from the article: "A set of flat codes, called Î codes, which are particularly suitable for storing web graphs (or, in general, integers with power-law distribution in a certain exponent range). The fact that these codes work well can be easily tested empirically, but we also try to provide a detailed mathematical analysis."
Maybe it's my ADD. Maybe it's my inherent dumbassedness... but I can't grok that.
So what is a web graph? How is that related to PageRank? If I download all this data, what the hell do I use it for?
even if it was improved upon. Can the idea of ranking based on links popularity be patented? Did google patent it? if not, how much longer before some asshole lawyer in melo park or amazon/ms/aol does and tries to shut down google.
because I have been enjoined by this Holy Office to abandon the false opinion which maintains that the Sun is the centre
And let's not forget... not all of us even get exposed to page rank regularly.
On my Mac for example, I can't see it at all. On my Wintel I can, thanks to the Google toolbar.
Here (for free)
Here too (for free)
This one too (for free)
This one also (free)
And don't forget this classic ($30 poster)
-T
If WebGraph can inflate their google PR to 10 then I'm a believer. Until then this looks like one of the many tools available to analyse your PR.
The WebGraph tool may be interesting for college students but the webmasters that are interested in seo techniques are going to find little use out of this tool.
I don't know what the big deal is. I've always been able to do pagerank on my computer...
Why reinvent the wheel?
--- "1.21 Jigawatts!" -Doc
I wonder if I can use pagerank algorithm for the smaller universe of my harddrive itself?
I have over 6,000 files on my PC many of which link to each other, and I am adding more links between them as time goes by. The collection is now so big that I can't even revist my own files and reason out the implications of the links between pages, beacuse of the huge time it would take to even spend a minute on each saved file.
I wonder if something like Pagerank will let the important files that are linked by many others on my PC to rise "up" like the cream to say, and I can avoid having to use keywords and categories to wade through all the clutter on my harddrive ...
Any other ideas of how to study the relationship between my 6,000+ files?
I also have quite a few articles, e.g. news items saved from the web itself. I wonder if the pagerank of google for those saved articles could also help me flush out the important "external" articles on my harddrive itself.
To see a world in a grain of sand, and then to step back and see the beach where the sand lies
That reminds me of Asimov's pyschohistory and the Second Foundation. That the First Foundation had to be unaware of the influence of the Second Foundation for it to work. Maybe that makes Searchking the Mule?
Have you ever developed a website and monitored SE traffic to it? PageRank is not a real time process. If you're running your own version of it, updating it once per month is more than enough.
SearchIRC - Now with live chat directory!
TV without Nielsen ratings would be better too, for similar reason.
We have TV without Nielsen ratings. We call it "PBS."
Is PBS better? Sometimes. Perhaps even often in recent years. Certainly no one has ever referred to PBS' content as mindless drivel, the way we talk about things like Survivor and American Idol.
But let me ask you this: If you could have only one TV station and you had to choose between ABC, CBS, FOX, NBC and PBS would you choose PBS? Didn't think so.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Don't think for a moment that google is not tracking and saving this.
Here's the algorithm:
.000001% of uncompressible pages, store it as is (full page follows).
000: page is spam. Ignore it.
001: page is porn. Porn is all the same, show porn page from disk.
010: page is pop-up ad. Block it.
011: page is a 404.
100: page has javascript. Show random javascript error.
101: page is Slashdot.
110: page is Slashdot.
111: page belongs to the
See charts for twitter trends on Trendistic
Anyone know of anything like that?
|>>?
I really shouldn't rise to this bait, but I can't resist: yes, given the choice between those networks, I would choose PBS. Just as I would take a non-profit-driven Internet, public radio over Clear Channel and its ilk, and community mesh wireless networks over 3G mobile phone service.
Google has been, so far at least, a rare exception in the world of privatized communications utilities, by consistently showing a amazing lack of intention to lock people into their service, using either exclusivity agreements of some sort or the simple expedient of proprietary technology (i.e., "increase your PageRank by 10% if you support new encrypted GoogleML tags on your site!"). Nothing is permanent, though, and as we all know, single points of failure are a no-no.
So, to bring all this back somewhere in the general neighborhood of the main story: further distributing the capability to build "mini-Googles", or specialized, community-maintained (but still fairly large-scale in terms of number of pages and links indexed) search tools is very interesting, and a useful body of technology to perpetuate.
Or, even more generally, the technology needed to do large-scale storage, analysis, and manipulation of directed graph structures is a very useful tool. Software analysis often relies heavily on large graphs showing dependencies, caller-callee relationships, variable accesses, etc., as do any number of AI subdomains like knowledge representation and planning systems.
Of course it's a hattrick. Two reasons just isn't impressive, and four is starting to get cumbersome (like you are droning on). Now 7-boredom are funny again too (depending on how funny the reasons themselves are), but I wasn't sure I could come up with seven on short notice. Three reasons was the optimum length for that post.
'Sensible' is a curse word.
Exactly how my last post was trolling?
Ripping an new rectum in the fabric of spacetime.