Interesting Concepts in Search Engines
TheMatt writes "A new type of search algorithm is described at NSU. In a way, it is the next generation over Google. It works off the principle that most web pages link to pages that concern the same topic, forming communities of pages. Thus, for academics, this would be great as the engine could find the community of pages related to a certain subject. The article also points out this would be good as an actually useful content filter, compared to today's text-based ones."
Google pioneered the use of links to deducepages' relevance. Its PageRank technology counts a link from site A to site B as a vote for B from A. But it does not take account of all the other sites to which A has links, as NEC's new technique does.
I won't pretend to know all the inner workings of google's search engine technology. But I believe that google DOES care about other links from site A. This falls into the hub and authority model, which is definined recursively. A hub is a site that links to a lot of authority sites. An authority site is a site that is linked to by a lot of hubs. Basically, authorities provide the content, and hubs provide links to the content. In this example, B is an authority site, and A is a hub.
The way the ranking works, is that if B is linked to by a large number of quality hub sites, then it has a respectively large quality rating. Likewise, if a hub links to a large quantity of high quality authority sites, then its quality will also be ranked highly as a result.
This also allows Google to provide links to sites even if the search terms don't match the content of that site. A hub that links to a lot of sites about cars will relate cars to ALL the links regardless if the word "car" is included on the site that is provided.
Of course, I'm not THAT familiar with google. Its possible I'm full of bunk. But I'm pretty sure it works this way to some extent and that google does pay attention to the hub based links.
-Restil
Play with my webcams and lights here
Here are a few papers that better describe the rank technology involved:
/ v5 i1p1.html
x tr acting_macrosopic_information_from_web_links.pdf
http://www.cindoc.csic.es/cybermetrics/articles
http://www.scit.wlv.ac.uk/~cm1993/papers/2001_E
.
Quidquid latine dictum sit altum viditur
Here is the research working paper that goes into detail.
~ fact is not dependant upon your belief therein. ~ ~ Have I therefore become your enemy because I tell you the truth?
Clustering pages is what other search engines like Teoma are doing already.
In a recent interview in c't magazine, a Google employee (Urs Hölzle) said, when asked about clustering, that they had tried that a long time ago, but they never got it to work successfully. He mentioned two problems:
- the algorithms they came up with delivered about 20 percent junk links for almost all topics
- it's hard to find the right categories and give them correct names, esp. for very generic queries
Of course, just because Google didn't get it to work properly doesn't mean nobody else can. But it's harder than it looks, and it's been known for quite a while.
A postscript document detailing his research.
Also, if you're a member of IEEE Computing, you can see his publication.
The idea predates Google, it probably predates you. They did it in print, way back when.
Did you read the update on the page, or are you just parroting the previous +5 post on this?
Since this was first brought up a few days ago, the Scientology volunteer editor at the Open Directory Project, an upstream content provider for Google, was fired.
For anyone out there who doesn't quite know why this is +5 worthy, here is the joke:
:) This is, of course, a very pop-culture oriented joke that will probably fade even more quickly than AYB did after its behemoth prime of last year and the December before. Long live the meme.
Super Bowl Sunday a commercial aired, featuring none other than Kevin Bacon at a retail store, trying to use a check to pay for his goods. The man behind the counter asked to see ID, but Bacon didn't have any on him. What now? Bacon runs around town gathering people (an extra he played in a movie with, a doctor, a priest, an attracive girl, and maybe one other guy?), who all had some ties to one another, through the other 6 in the group. The attractive girl once dated the sales clerk in the store, so Kevin explains that they are "practically bothers," hence putting to good use the principle of 7 degrees of seperation.
Therefore, the humor lies within.
Man is born free; and everywhere he is in chains.
Incidentally, Web of Science also indexes Humanities and Social Science publications.
Clever does Google one better by separating the results of searches into "hubs" and content. Hubs are sites with lots of links on a particular subject. Content sites are the highly rated sites linked to by the hubs.
I thought it was a very intersting concept and I am surprised that it was not comercialized. Of course, IBM is in the business of buying banner ads rather than selling them. They could always do like /. and OSDN and mostly run ads for their own stuff though....
Lasers Controlled Games!
This sounds like a subset.
I've seen a page of Google search results where a "Related pages" link was provided below certain search results.
Here are the top 1000. Number 1 is Christopher Lee (Saruman in FotR), probably largely because he's been in 228 films.
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
This *is* a subset of Google. It's well known that a site talking about an art topic that is linked from many sites that rank high on art and link heavily among themselves will rank higher than the exact same site if it is linked by sites that are themselves linked from few art sites, no matter how heavily they are linked in other domains.