Interesting Concepts in Search Engines

← Back to Stories (view on slashdot.org)

Interesting Concepts in Search Engines

Posted by CmdrTaco on Thursday March 7, 2002 @08:21AM from the stuff-to-think-about dept.

TheMatt writes "A new type of search algorithm is described at NSU. In a way, it is the next generation over Google. It works off the principle that most web pages link to pages that concern the same topic, forming communities of pages. Thus, for academics, this would be great as the engine could find the community of pages related to a certain subject. The article also points out this would be good as an actually useful content filter, compared to today's text-based ones."

15 of 230 comments (clear)

Min score:

Reason:

Sort:

Some issues on linking. by Restil · 2002-03-07 08:40 · Score: 5, Informative

Google pioneered the use of links to deducepages' relevance. Its PageRank technology counts a link from site A to site B as a vote for B from A. But it does not take account of all the other sites to which A has links, as NEC's new technique does.

I won't pretend to know all the inner workings of google's search engine technology. But I believe that google DOES care about other links from site A. This falls into the hub and authority model, which is definined recursively. A hub is a site that links to a lot of authority sites. An authority site is a site that is linked to by a lot of hubs. Basically, authorities provide the content, and hubs provide links to the content. In this example, B is an authority site, and A is a hub.

The way the ranking works, is that if B is linked to by a large number of quality hub sites, then it has a respectively large quality rating. Likewise, if a hub links to a large quantity of high quality authority sites, then its quality will also be ranked highly as a result.

This also allows Google to provide links to sites even if the search terms don't match the content of that site. A hub that links to a lot of sites about cars will relate cars to ALL the links regardless if the word "car" is included on the site that is provided.

Of course, I'm not THAT familiar with google. Its possible I'm full of bunk. But I'm pretty sure it works this way to some extent and that google does pay attention to the hub based links.

-Restil

--
Play with my webcams and lights here
1. Re:Some issues on linking. by Anonymous Coward · 2002-03-07 11:32 · Score: 1, Informative
  
  minor nitpick :
  
  Hits: init h,a
  while ( not convergence)
  {
  a=Lh;
  h=L^(-1)a;
  ^^^^^^^^^^^^^
  }
  
  should really be
  h = L^T a;
  
  Another name for this is the
  Kleinberg algorithm. i hope the parent gets
  modded up since i have seen many people
  mix up the page rank algorithm with
  kleinberg's.
  the hub and authority model is more elegant
  IMO than the page rank algorithm which
  does'nt have as great an intuitive justification
More Info on Extracting Macroscopic Information by LuxuryYacht · 2002-03-07 08:41 · Score: 2, Informative

Here are a few papers that better describe the rank technology involved:

http://www.cindoc.csic.es/cybermetrics/articles/ v5 i1p1.html

http://www.scit.wlv.ac.uk/~cm1993/papers/2001_Ex tr acting_macrosopic_information_from_web_links.pdf

.

--
Quidquid latine dictum sit altum viditur
Efficient Identification of Web Communities by headwick · 2002-03-07 08:42 · Score: 2, Informative

Here is the research working paper that goes into detail.

--
~ fact is not dependant upon your belief therein. ~ ~ Have I therefore become your enemy because I tell you the truth?
Clustering by harmonica · 2002-03-07 08:44 · Score: 5, Informative

Clustering pages is what other search engines like Teoma are doing already.

In a recent interview in c't magazine, a Google employee (Urs Hölzle) said, when asked about clustering, that they had tried that a long time ago, but they never got it to work successfully. He mentioned two problems:
- the algorithms they came up with delivered about 20 percent junk links for almost all topics
- it's hard to find the right categories and give them correct names, esp. for very generic queries

Of course, just because Google didn't get it to work properly doesn't mean nobody else can. But it's harder than it looks, and it's been known for quite a while.
Re:Sparse on details and a working demo by jsprat · 2002-03-07 08:47 · Score: 3, Informative

His homepage

A postscript document detailing his research.

Also, if you're a member of IEEE Computing, you can see his publication.
No, this is not the shiny new thing... by Anonymous Coward · 2002-03-07 09:08 · Score: 2, Informative

ISI has been doing this for years with their databases. You look at a research paper, and jump around by what it cites and what cites it. It's good stuff, helps you find research that's related to what you're doing that you'd have never thought to actually search for.
The idea predates Google, it probably predates you. They did it in print, way back when.
Re:Exploiting search engines that rank popularity by tiltowait · 2002-03-07 09:08 · Score: 5, Informative

Did you read the update on the page, or are you just parroting the previous +5 post on this?

Since this was first brought up a few days ago, the Scientology volunteer editor at the Open Directory Project, an upstream content provider for Google, was fired.
Explanation of the joke by Wire+Tap · 2002-03-07 09:22 · Score: 3, Informative

For anyone out there who doesn't quite know why this is +5 worthy, here is the joke:

Super Bowl Sunday a commercial aired, featuring none other than Kevin Bacon at a retail store, trying to use a check to pay for his goods. The man behind the counter asked to see ID, but Bacon didn't have any on him. What now? Bacon runs around town gathering people (an extra he played in a movie with, a doctor, a priest, an attracive girl, and maybe one other guy?), who all had some ties to one another, through the other 6 in the group. The attractive girl once dated the sales clerk in the store, so Kevin explains that they are "practically bothers," hence putting to good use the principle of 7 degrees of seperation.

Therefore, the humor lies within. :) This is, of course, a very pop-culture oriented joke that will probably fade even more quickly than AYB did after its behemoth prime of last year and the December before. Long live the meme.

--
Man is born free; and everywhere he is in chains.
Re:Bad Idea - What Happens to Science? by Anonymous Coward · 2002-03-07 09:34 · Score: 1, Informative

If your research institution has a subscription, you can always use something like Web of Science, formerly known as the Science Citation index. This is a much better tool for finding papers in refereed journals about a particular topic than just searching the web, whatever engine you use. Alternatively, you can search the web on a particular topic, find out who some of the important researchers are, and search Web of Science for their papers.
Incidentally, Web of Science also indexes Humanities and Social Science publications.
This is not a new idea by John+Harrison · 2002-03-07 09:45 · Score: 3, Informative

I will refer you to the Clever project at IBM. I first read about this years ago when Google was still a project at google.stanford.edu.
Clever does Google one better by separating the results of searches into "hubs" and content. Hubs are sites with lots of links on a particular subject. Content sites are the highly rated sites linked to by the hubs.
I thought it was a very intersting concept and I am surprised that it was not comercialized. Of course, IBM is in the business of buying banner ads rather than selling them. They could always do like /. and OSDN and mostly run ads for their own stuff though....

--
Lasers Controlled Games!
1. Re:This is not a new idea by John+Harrison · 2002-03-07 11:57 · Score: 3, Informative
  
  How do you know this is not how Google creates its search results? What you've described sounds exactly like how Google describes their technology:
  I know because I have read about both technologies. I discussed the merits of Clever v. Google a few years ago with classmates that were taking the class at Stanford that spawned Google. That is how I know.
  End of Rant
  There is an excellent article on Clever that appeared in Scientific American a few years ago. It was linked to from the page I origianlly posted. You should check it out. Clever returns results divided into the catergories of "hubs" and "authorities". I have never noticed Google doing that/
  Here is an excellent summary from the article on the differences between Clever and Google:
  Google and Clever have two main differences. First, the former assigns initial rankings and retains them independently of any queries, whereas the latter assembles a different root set for each search term and then prioritizes those pages in the context of that particular query. Consequently, Google's approach enables faster response. Second, Google's basic philosophy is to look only in the forward direction, from link to link. In contrast, Clever also looks backward from an authoritative page to see what locations are pointing there. In this sense, Clever takes advantage of the sociological phenomenon that humans are innately motivated to create hublike content expressing their expertise on specific topics.
  Of course Google has tweaked their method since this article was written, however it has not become Clever.
  
  --
  Lasers Controlled Games!
Re:Isn't this just a subset of Google by Anonymous Coward · 2002-03-07 10:26 · Score: 1, Informative

This sounds like a subset.

I've seen a page of Google search results where a "Related pages" link was provided below certain search results.
Re:Oracle of Bacon by swillden · 2002-03-07 11:14 · Score: 3, Informative

Here are the top 1000. Number 1 is Christopher Lee (Saruman in FotR), probably largely because he's been in 228 films.

--
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
Re:Isn't this just a subset of Google by Anonymous Coward · 2002-03-07 12:27 · Score: 1, Informative

This *is* a subset of Google. It's well known that a site talking about an art topic that is linked from many sites that rank high on art and link heavily among themselves will rank higher than the exact same site if it is linked by sites that are themselves linked from few art sites, no matter how heavily they are linked in other domains.