Interesting Concepts in Search Engines
TheMatt writes "A new type of search algorithm is described at NSU. In a way, it is the next generation over Google. It works off the principle that most web pages link to pages that concern the same topic, forming communities of pages. Thus, for academics, this would be great as the engine could find the community of pages related to a certain subject. The article also points out this would be good as an actually useful content filter, compared to today's text-based ones."
Google pioneered the use of links to deducepages' relevance. Its PageRank technology counts a link from site A to site B as a vote for B from A. But it does not take account of all the other sites to which A has links, as NEC's new technique does.
I won't pretend to know all the inner workings of google's search engine technology. But I believe that google DOES care about other links from site A. This falls into the hub and authority model, which is definined recursively. A hub is a site that links to a lot of authority sites. An authority site is a site that is linked to by a lot of hubs. Basically, authorities provide the content, and hubs provide links to the content. In this example, B is an authority site, and A is a hub.
The way the ranking works, is that if B is linked to by a large number of quality hub sites, then it has a respectively large quality rating. Likewise, if a hub links to a large quantity of high quality authority sites, then its quality will also be ranked highly as a result.
This also allows Google to provide links to sites even if the search terms don't match the content of that site. A hub that links to a lot of sites about cars will relate cars to ALL the links regardless if the word "car" is included on the site that is provided.
Of course, I'm not THAT familiar with google. Its possible I'm full of bunk. But I'm pretty sure it works this way to some extent and that google does pay attention to the hub based links.
-Restil
Play with my webcams and lights here
Clustering pages is what other search engines like Teoma are doing already.
In a recent interview in c't magazine, a Google employee (Urs Hölzle) said, when asked about clustering, that they had tried that a long time ago, but they never got it to work successfully. He mentioned two problems:
- the algorithms they came up with delivered about 20 percent junk links for almost all topics
- it's hard to find the right categories and give them correct names, esp. for very generic queries
Of course, just because Google didn't get it to work properly doesn't mean nobody else can. But it's harder than it looks, and it's been known for quite a while.
Did you read the update on the page, or are you just parroting the previous +5 post on this?
Since this was first brought up a few days ago, the Scientology volunteer editor at the Open Directory Project, an upstream content provider for Google, was fired.