Slashdot Mirror


PageRank-Type Algorithm From the 1940s Discovered

KentuckyFC writes "The PageRank algorithm (pdf) behind Google's success was developed by Sergey Brin and Larry Page in 1998. It famously judges a page to be important if it is linked to by other important pages. This circular definition is the basis of an iterative mechanism for ranking pages. Now a paper tracing the history of iterative ranking algorithms describes a number of earlier examples. It discusses the famous HITS algorithm for ranking web pages as hubs and authorities developed by Jon Kleinberg a few years before PageRank. It also discusses various approaches from the 1960s and 70s for ranking individuals and journals based on the importance of those that endorse them. But the real surprise is the discovery of a PageRank-type algorithm for ranking sectors of an economy based on the importance of the sectors that supply them, a technique that was developed by the Harvard economist Wassily Leontief in 1941."

32 of 108 comments (clear)

  1. Good advice for all developers by ls671 · · Score: 4, Insightful

    Well, this is actually pretty good advice for any developer; Don't reinvent the wheel. Look around, search for what's been done before and adapt it to suit your needs. Of course, as a last resort, one can design something new once he has done his homework and made sure nothing that has been done before may be re-used.

    Through my life, I have seen a amazing high level of work that has been done in vain because it yielded poor results and that something doing the same better already existed anyway.

    Don't get me wrong here, once you have made sure that nothing already existing suits your needs or can be reused, it is fine to innovate and create real new stuff. Just don't get caught trying to reinvent the wheel unless you reinvent it better ;-)

    Also, an exception to that principle could be allowed for trivial tasks that are really quick to implement and where searching for an existing solution might cost more than implementing it yourself but be really careful applying that exception rule, it is an open door that leads to trying to reinvent the wheel sometimes ;-))

    --
    Everything I write is lies, read between the lines.
    1. Re:Good advice for all developers by cyphercell · · Score: 5, Funny

      It would have been a pretty exhaustive search without google.

      --
      Under the influence of Post-Cyberpunk Gonzo Journalism
    2. Re:Good advice for all developers by commodore64_love · · Score: 3, Informative

      They could have used one of the other search engines in existence in 1997-98, like Altavista or Lycos or Magellan or WebCrawler or Yahoo or Excite or Hotbot.

      --
      "I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
    3. Re:Good advice for all developers by pipatron · · Score: 4, Funny

      I can see that you must be young enough never to have used the search engines you list, if you suggest that you would have been able to find anything useful.

      --
      c++; /* this makes c bigger but returns the old value */
    4. Re:Good advice for all developers by Weirsbaski · · Score: 3, Insightful

      Well, this is actually pretty good advice for any developer; Don't reinvent the wheel. Look around, search for what's been done before and adapt it to suit your needs, and patent it.

      --

      I am not a sig.
    5. Re:Good advice for all developers by Junior+J.+Junior+III · · Score: 3, Interesting

      "don't reinvent the wheel" is kindof dumb advice when you think about it.

      If I didn't already have a wheel, it would take me a really long time to traverse the world in search of a wheel to see if it had been invented yet. If it has, and it's got sufficient penetration into the market that I know about them already, then, sure, it's a no brainer not to reinvent it. On the other hand, if no one in my immediate vicinity has ever heard of the wheel, then inventing one -- quickly -- is a lot smarter than traversing the known world until I run into a culture that already has wheels. Especially if they might exploit their superior technology to subjugate and enslave my people. Better to have a home-brewed shitty wheel to start off with, and upgrade quickly if I discover that there are other friendly cultures that have better wheels already, and have at least something if I don't discover anyone else, or discover hostiles who already have them.

      As long as the cart is loosely integrated with the wheels I have, upgrading to better wheels when they are found to be available should be easy. And I might just learn something about wheels while studying them that applies to other problems, or could even possibly improve the existing state of the art with respect to wheels.

      --
      You see? You see? Your stupid minds! Stupid! Stupid!
    6. Re:Good advice for all developers by Junior+J.+Junior+III · · Score: 2, Insightful

      Reinventing the wheel is a phrase that means to duplicate a basic method that has long since been accepted and even taken for granted.

      K, so how is Brin and Page developing PageRank when an obscure economics paper published at Harvard in 1941 and only re-discovered in 2010 reinventing the wheel?

      Would Page and Brin have gotten there faster if they'd rooted through Harvard's economics library in the hopes that the best-yet algorithm for search results ranking would be there, somewhere? Was the Harvard paper "long accepted and taken for granted" or was it "forgotten and ignored"? Is PageRank a "duplication of a basic method"?

      Personally, I think google got there quicker by re-inventing the wheel, if that's what this was. In my opinion, if something is only recognized as reinvention of the wheel after the fact, it ipso facto is not reinvention of the wheel in the sense described in the wikipedia article you cited.

      --
      You see? You see? Your stupid minds! Stupid! Stupid!
    7. Re:Good advice for all developers by ls671 · · Score: 2, Interesting

      > As for ranking pages by links to other important pages, that's rather flawed?

      I hinted that it was OK to innovate or even re-invent. You have to know what you are doing although.

      Actually, I totally agree with you on the quoted phrase but I am still looking for a solution (Holy Grail?) to supplant Google implementation. ;-))

      Seriously, I really spent some time thinking about this...

      --
      Everything I write is lies, read between the lines.
    8. Re:Good advice for all developers by Hurricane78 · · Score: 2, Interesting

      There is a second rule of advice that goes with this, but that unfortunately usually is forgotten:

      Don’t imitate. Innovate!

      Yes, it is a good idea to not reinvent the wheel. But it’s even better to invent an airplane! (You know: Thinking outside the box. “Inside the box” would be a square wheel. ;)

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    9. Re:Good advice for all developers by Midnight+Thunder · · Score: 2, Insightful

      I can see that you must be young enough never to have used the search engines you list, if you suggest that you would have been able to find anything useful.

      Well the results were there, somewhere between all the adverts.

      --
      Jumpstart the tartan drive.
    10. Re:Good advice for all developers by twosat · · Score: 3, Informative

      Reminds me of the invention of Turbo Codes in the early Nineties for forward error correction for communication networks. It was later discovered that in the early Sixties, low density parity check (LDPC) coding was developed that performed a similar function but was not used because of the lack of computer power and memory back then. The LDCP patents had expired by then, so now there are two technologies doing the same thing in a different way but one is patent-free. In a similar vein, I read some years ago of a company in the UK who search through expired and current patents looking for inventions that meet their customers' needs. They would often find solutions in a totally different field to the area being researched and a lot of it was stuff that was ahead of its time and its technology had been abandoned.

    11. Re:Good advice for all developers by commodore64_love · · Score: 2, Insightful

      Remember the internet was a LOT smaller in 1997, so while my preferred engine at that time (altavista) may not have been as good as google today, there was also a lot fewer websites to search through.

      >>>I can see that you must be young enough never to have used the search engines you list

      I'm familiar with the slashdot practice of not RTFA (reading the frakking article), but when did people stop reading user names? You think I'm young? My first computer was a Commodore and my second an Amiga. I was making music, producing primitive videos, and "surfing" the internet before the web even existed.

      --
      "I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
    12. Re:Good advice for all developers by b4dc0d3r · · Score: 2, Insightful

      Wow, +5 Pedantic.

      Reinventing the wheel implies that they set out to find a measure of importance. So they would have had to decide to make a search engine, have it produce relative results, decide that relative importance is the key, and then go searching for ways to find relative importance. There's a major gap there, a leap in thinking that simply wasn't present at the time. How important a page is indicates what order it should be in results. That makes sense, but it wasn't obvious at the time. Previous results were based on things like the number of times the page mentions your word, or what order the pages were added. People tried to figure out: what is that quality which makes a page more relevant than others? And they failed.

      The key was settling on Importance, or we could call it Charisma. When you mention the name "Brad", do more people think of the guy you work with first or Brad Pitt? Brad Pitt is more relevant (technically relevant to more people), but why? More people know of him, and more people speak of him. More importantly, more important people speak of him.

      Scientific papers have been measured for their influence factor this way (but that might be a false correlation: http://www.physorg.com/news165950992.html "... papers published early in a field receive citations essentially regardless of content because they are the only game in town.") If they had looked to science to see what makes something influential, they would have seen the same concept. But they didn't know they needed to find "influential", just "relevant".

      The breakthrough Google made was deciding on a quality which made things more relevant, which is roughly equivalent to notoriety. Not just the number of references to a page, but the weight of those references in relation to who references them. That's where link farming sprouted, and they had to figure a way to cancel that effect out.

      How many people know this page, as opposed to that page? And then they had to figure a way to find the pages, process the data, calculate notoriety, and serve it up quickly, and create a revenue stream from all of that. It's not about whether you're going to arrive more quickly by re-inventing the wheel. Often times you can, especially if it's in a language where you don't know all of the built-in functions. You can write a linked list with sorting faster than searching for how the language implements it under certain circumstances.

      It's about whether you will arrive better, at a better solution in other words. The point was to look elsewhere for implementation, but the part you missed was you have to have inspiration to know where to look.

      Google wanted to get something to lots of people. They might have had stuff in baskets. They could re-invent the wheel to make the baskets easier to transport, or they could build up a farm, attracting farm hands and their families and gradually build up a town, making the people come to Google instead. Once you know you need a transportation solution, it's easier to copy an existing idea. It just so happens that once you decide the solution to your problem is relative importance, there's a description of how to do it in a book from the 1940's.

    13. Re:Good advice for all developers by MobileTatsu-NJG · · Score: 2, Funny

      I can see that you must be young enough never to have used the search engines you list, if you suggest that you would have been able to find anything useful.

      What are you talking about? I found useful results all the time! I just wish my interest in being a porn connoisseur could be turned into a career.

      --

      "I like to lick butts!" by MobileTatsu-NJG (#32700246) (Score:5, Informative)

    14. Re:Good advice for all developers by Ben1220 · · Score: 2, Funny

      This is exactly why I would rather be a Computer Scientist working at a university or industrial research lab then a software developer. Because I want to create real new stuff.

  2. Just more proof... by pizza_milkshake · · Score: 2, Interesting

    Nil novi sub sole

  3. Re:Patent? by Jorl17 · · Score: 3, Funny

    If you hate Google: Yes. If you don't: No. If you want Bananas: Get them.

    --
    Have you heard about SoylentNews?
  4. Re:Patent? by Meshach · · Score: 5, Informative

    So it could be used as previous art to invalidate Google's patent?

    From my read of the linked article it seems that Sergey and Larry cited the previous art in their publications. So it looks like there was no plagiarism, just building a new idea using the tools provided by an earlier idea.

    --
    "Maybe this world is another planet's hell"
    Aldous Huxley
  5. Re:Patent? by nedlohs · · Score: 3, Informative

    No, since the one from 1941 didn't say "on the internet" or "with a computer".

  6. linearity by bcrowell · · Score: 3, Interesting

    What really shocked me when someone first described page rank to me was that it was linear. I felt that this just had to be wrong, because it didn't seem right for a *million* inbound links to have a *million* times the effect compared to a single inbound link. Maybe this is just the elitist snob in me, but I don't feel that the latest American Idol singer is really a thousand times better than Billie Holliday, just because a thousand times more people listen to him than to her. If it was me, I'd have used some kind of logarithmic scaling. I think people do usually describe page ranks in terms of their logarithms, but that's taking the log on the final outcome. I'm talking about taking logs at each step before going on to the next iteration.

    To me, this has an intuitive connection to the idea that the internet used to be more interesting and quirky, and it was more about individuals expressing themselves, whereas now it's more like another form of TV.

    Of course that's not to say that I want to go back to the days before page rank. God, search engine results were just horrible in those days.

    From an elitist snob point of view, one good thing about page rank is that it doesn't let you just vote in a passive way, as Nielsen ratings do for TV. In order to have a vote, you have to do something active, like making a web page that links to the page you want to vote for.

    1. Re:linearity by Ibiwan · · Score: 5, Insightful

      From a naive, off-the-cuff armchair analysis, it seems to me that PageRank only serves as a way to provide ordering of search results. Funny thing... sorting on positive values will always yield the same ordering as a sort on those values' logs.

      --
      -- //no comment
    2. Re:linearity by martin-boundary · · Score: 4, Informative

      What really shocked me when someone first described page rank to me was that it was linear. I felt that this just had to be wrong, because it didn't seem right for a *million* inbound links to have a *million* times the effect compared to a single inbound link. Maybe this is just the elitist snob in me,

      The algorithm is not at all linear in the effect of inbound links. Two inbound links don't have the same effect, instead their effects are first weighted by the PageRank of each node of origin.

      Now the distribution of PageRank among nodes is approximately power-law distributed on the web. Intuitively, this means that among all inbound links of a particular node, when that number is high, then 99% have practically no effect on the rank of that node, exactly as you probably thought in the first place. More precisely, you can expect a pareto (or similar) distribution for the influence of incoming nodes, which is not unexpected since these sorts of distributions occur a lot in social sciences.

      That said, the PageRank algo is actually linear, but only in the sense of being a linear operator on weight distributions. If you normalize the weights after each iteration, the algo is actually affine (on normalized distributions) rather than linear.

    3. Re:linearity by Shaterri · · Score: 3, Informative

      The reason why PageRank 'has to be' linear is essentially mathematical; treating importance as a linear function of the importance of inbound links means that the core equations that need to be solved to determine importance are linear and the answer can be found with (essentially) one huge matrix inversion. If you make importance nonlinear then the equations being solved become computationally infeasible.

      What's interesting to me is how close the connections are between PageRank and the classic light transfer/heat transfer equations that come up in computer graphics' radiosity (see James Kajiya's Rendering equation); I wonder if there's a reasonable equivalent of 'path tracing' (link tracing?) for computing page ranks that avoids the massive matrix inversions of the basic PageRank algorithm.

    4. Re:linearity by phantomfive · · Score: 2, Insightful

      Maybe this is just the elitist snob in me, but I don't feel that the latest American Idol singer is really a thousand times better than Billie Holliday, just because a thousand times more people listen to him than to her.

      You are measuring the wrong thing. Google isn't measuring who is 'better,' it is trying to measure what page is more interesting to a web surfer, and pages tend to be more popular because they are more interesting to more people. Thus if you do a search for Brittany, you are more likely to find Brittany Spear's fan club than you are an academic study of why her beautiful innocence was so popular (and oh yes was it beautiful!). People who are looking for more specific academic things learn to add extra search terms to their query, like "Brittany analysis" or "Why is Brittany popular?"

      The way to solve this problem better is for Google to get to know you and your preferences: if Google knows that you are mainly interested in academic sorts of things, then it can automatically return that sort of thing when you do a search for Brittany. This is convenient, but drives certain people crazy because of privacy issues.

      --
      Qxe4
    5. Re:linearity by j1m+5n0w · · Score: 2, Informative

      it didn't seem right for a *million* inbound links to have a *million* times the effect compared to a single inbound link

      Pagerank isn't just a citation count; it's defined recursively, such that a link from a page with a high pagerank is worth more than a link from a page with low pagerank. Similarly, a link from a page with many outlinks is worth less than a link from a page with the same rank but few outlinks.

      It does turn out to be more of a popularity contest than a quality metric, though. I think you're absolutely right about that.

    6. Re:linearity by Jerry+Talton · · Score: 4, Insightful

      *sigh*

      You understand neither how the parent post is using the word "linear" nor the PageRank algorithm itself. You can rewrite the eigenproblem at the heart of PageRank as the solution to a linear system, but very few people do. Moreover, this is not the correct intuition to employ to understand what's going on: there are no "massive matrix inversions" here, just a simple iterative algorithm for extracting the dominant eigenvector of a matrix.

      Furthermore, you've got it exactly backwards regarding the "connection" between PageRank and light transfer. Since the Markov process used to model a web surfer in the PageRank paper is operating on a discrete domain with an easily-calculable transition function, the stationary distribution (or ranking) can be determined exactly. In rendering, you have a continuous problem for which Markov chain Monte Carlo techniques provide one of the most efficient ways to approximate the solution...but you have to actually simulate a Markov chain to get it (see, for instance, Veach's seminal paper on Metropolis Light Transport). Computing PageRank is an "easy" problem, by comparison.

  7. Or there's the number ... by PPH · · Score: 5, Funny

    ... of times one Pharaoh's cartouche was chiseled into the obelisks beloning to other Pharaohs.

    --
    Have gnu, will travel.
  8. additional ranking algorithm in the 1940s paper by commodoresloat · · Score: 4, Funny

    allowed pages to be ranked and categorized according to whether it was "insightful," "interesting," "informative," "funny," "flamebait," or "troll."

  9. Vistas in Information Handling, Spartan Press 1962 by ArmchairAstronomer · · Score: 2, Interesting

    The most amazing computer book ever. It has Doug Englebart's first description of “augmenting the human intellect” using computers. It describes what we know now as windows (generic) with pointing devices. It has an early linear document retrieval system using page ranks based on word co-occurrences and it has an early language translation system (Russian to English with examples of translating Soviet missile papers). What a preview of things to come.

    It is worth a read just to get into the heads of some of the computing pioneers.

    Another required reading book for all aspiring CS students should be John Von Neumann’s the “Computer and the Brain.” Dated, but again this is what they were thinking.

    We have a lot to be humble about given the hardware and compilers they had to work with. Not to mention primitive development environments, a.k.a. the card punch.

  10. Re:Patent? by mirix · · Score: 4, Funny

    I'd better get a rush on a patent for "using pagerank on the internet" then. Take that google.

    --
    Sent from my PDP-11
  11. link to the actual paper by phossie · · Score: 2, Interesting

    The summaries were intriguing but lame. Here's the real thing (preprint):

    http://arxiv.org/abs/1002.2858

    Author's page is here:

    http://users.dimi.uniud.it/~massimo.franceschet/

    Interesting stuff.

    --

    [|]
  12. Re:who says it was rediscovered in 2010? by TerranFury · · Score: 2, Informative

    Dude; it's just Jacobi iteration applied to a particular, reasonably intuitive model. This isn't to knock it -- just to say that it was probably easier to reinvent it than to find some obscure paper --- especially one which probably isn't in the electronic databases.