Slashdot Mirror


Interesting Concepts in Search Engines

TheMatt writes "A new type of search algorithm is described at NSU. In a way, it is the next generation over Google. It works off the principle that most web pages link to pages that concern the same topic, forming communities of pages. Thus, for academics, this would be great as the engine could find the community of pages related to a certain subject. The article also points out this would be good as an actually useful content filter, compared to today's text-based ones."

28 of 230 comments (clear)

  1. But.... by ElDuque · · Score: 3, Interesting


    Where would Slashdot fit in to this? There's links to everywhere!

    1. Re:But.... by eet23 · · Score: 2, Interesting

      But /. generally links to things that /.ers find interesting. I imagine that sites like this would be a good way of linking wider subject areas (computers + popular science) together.

    2. Re:But.... by Dudio · · Score: 3, Interesting

      It's not just Slashdot either. Blogs by their very nature link to sites/pages about anything and everything. If they could manage to programatically identify blogs with high accuracy, maybe they could develop a hybrid with Google's algorithm so that focused sites are crawled for links to related sites while blogs and such are used to cast popularity/usefulness votes.

  2. Problem. by DohDamit · · Score: 3, Interesting

    So...the engine crawls through, looking at links, goes to those sites, and looks at more links. So on and so on, until it has a web of links defined. The problem with this approach is that they have to have a VALID starting point OR a valid ending point in order for this method to be of use. In other words, either they have to manually start from a good site for physics, such as Stephen Hawking's homepage, or wind up at a good site for physics, such as oh, Stephen Hawking's website, in order to determine what's a good physics site. In the end, the content still has to be managed, or a porn site manager can still get around all this by linking to all kinds of sites, rather than stuffing their text/metatags. In the end, this solves nothing.

  3. What you say?! by Timmeh · · Score: 0, Interesting
    Better than Google? Blasphemy.

    Knowing google, they're probably already working on such a thing. Also, I question how well it would work, I'm not saying it doesn't, as they've already shown it to work, but completely ignoring the website's text is like taking two steps forward and three steps back. Wouldn't the ideal search engine combine the best of both worlds? Checking text and links?

  4. Bad Idea - What Happens to Science? by Ieshan · · Score: 3, Interesting

    What happens to journal articles relating to specific content? How do I find information for biology class?

    Currently, I can search google and find things on the destruction of Balsam Fir in Newfoundland by Alces Alces (Moose), with this type of search engine, the journals wouldn't be listed because they themselves don't have links to anywhere (most of them are straight magazine to html conversions or PDF).

    It'd be difficult as hell to find pertinent information above the level of "3y3 4m Johnny, And Dis 1s Mai W3bsite, 4nd H3r3 Ar3 Mai LinkZorz!"

    1. Re:Bad Idea - What Happens to Science? by TheMatt · · Score: 3, Interesting

      Ah, but there might be links. Most research group pages currently have links to their latest research in the journals. For example, my group has links to J.Chem.Phys.Online or the like, directly to the journals. This type of search could lead you to the journals that are in the area you searched (JCP for me, TetLett for an OrgChemist, etc.)

      Plus the fact that groups mainly link to others doing the same work. So, I can start at one page and soon get an idea of the cluster science community, for example.

      --

      Fortran programmer...oh yeah. Array math for life!

  5. DO NOT MESS WITH GOOGLE! :-) by inerte · · Score: 2, Interesting

    http://www.google.com/search?hl=en&num=10&q=relate d:slashdot.org/

    http://www.google.com/search?as_lq=www.slashdot. or g&btnG=Search

  6. Browser integration by ZaneMcAuley · · Score: 2, Interesting

    Would be nice to have a sidebar for IE or any other browser that could be used as a filter for relevent topics from that site configurable with N depth for searching. Highlighting of relevant topic links and maybe even a graph view.

    --
    ----- Whats wrong with this picture? http://www.revoh.org:1234/whatswrong
  7. The demise of another search engine? by indole · · Score: 3, Interesting


    As much as I (and all of you) love Google, I wonder whether their moral high ground approach to search results would not exist if they did not already have the worlds traffic searching through their site.
    Search engines come and go. When Google has to struggle for its existence against the Next Big Thing, how many of you really believe they won't sell out in order to keep themselves running, in effect putting the last nail in their own coffin.

    We shall see.

    --
    (2,3-Benzopyrrole)
  8. I wish! by dasmegabyte · · Score: 5, Interesting

    Problem with this: "most" websites do not link to sites with similar content. Most websites link to "partner" sites that have nothing in common with them -- after all, who links to a competitor?

    Good websites link to similar sites -- academic websites link to simialr sites and sources. This type of search engine would be killer on Internet 2. But on our wonderful, chaotic, porn and paid link filled Internet 1, it's useless. Spider MSN and you'll get a circular web leading to homestore, ms.com, Freedom To Onnovate, ZDNet and Slate. Spider Sun and never find a single page in common with their close competitors like IBM.

    What happens when sites get associated with their ads? Search on Microsoft Windows and grab a lot of casino and porn links...because a "security" site covered in porn banners was spidered and came up with top relevancy.

    Now, combined with a click-to-rate-usefulness engine like Google, this could be an interesting novelty. But it'll never be the simple hands off site hunter the big Goo has become.

    --
    Hey freaks: now you're ju
    1. Re:I wish! by Anonymous Coward · · Score: 2, Interesting

      "after all, who links to a competitor?"

      Well when you are not dealing with commercial sites, or even when you are some times, a lot of people.

      Google (and most other search engines) link to tons of other search engines, Art Galleries link to other Art Galleries, and gaming sites link to other gaming sites.

      A lot of other areas are inter-linking too. Some times when I am trying to find something I get caught in a loop and have to start over again from a different starting point.

      For instance when finding out information on LED lights.

      I exhausted most of the results that came in for my original search term, so am going to change it to {fibre,fiber} optics LED

      Tada, whole new batch of sites to read though. :)

    2. Re:I wish! by rapid+prototype · · Score: 2, Interesting

      Spider Sun and never find a single page in common with their close competitors like IBM.

      i guess this page doesn't list a bunch of Sun competitors, like IBM, BEA, and CA, then. even competitors thrive off of partering with each other.

      -rp

  9. Difficulty of Classification by gnugnugnu · · Score: 3, Interesting

    This sounds like a useful idea and give use better directory systems, but its utility would be limited. Im sure there will be people poking holes in this algorithm in no time. Slashdot has a odd mix of subjects loosely tied together. News for Nerds is not a very strict group. Classification and grouping is a hard problem. There is no clear black and white, there are many shades of gray.

    Interesting, but not as intersting as Google Bombing

    --
    Guilty

  10. Routers by Perdo · · Score: 3, Interesting

    The internet is the world's best source of information and while transport of that information is built in, organization of that information is not. We have only half an internet.

    We will really know what is out there on the net when Cisco includes a search function in their routers. Distributed searching. Access to over 90% of the world's data. Anonymous usage statistics. Person X searched for data (a) and spent the most time at www.example.xyz. Cross refrence it all and include hooks into TCP/IP V.x for cataloging search, usage and content statistics.

    A website might contain information about leftover wiffle-waffles. That website sends that same data 1000 times an hour to end users. I want the router to pipe up "1000 unique page veiws for leftover wiffle-waffles" somewhere else a router says "500 unique page veiws for leftover wiffle-waffles". So when I do my search, I get 2 hits, most popular and second most popular.

    Why incorperate it into TCP/IP? what good is moving all that data if it is just a morass of chaos? Let that which transports it also serve to catalog it. Currently, user data's content is transparent to TCP/IP. But if I wanted my data to be found, I could enclose tags that would allow the Router to sniff my data, insuring my data was included in the next real time search.

    --

    If voting were effective, it would be illegal by now.

  11. Exploiting search engines that rank popularity by Violet+Null · · Score: 5, Interesting

    Interesting article here at http://www.operatingthetan.com/google/ about how the Church of Scientology exploits google's ranking system.

    The basic gist is that google flags pages as more important (or higher relevance) if they have more links pointing to them...so the CoS makes thousands of spam pages that points at its main pages. Google sees the thousands of links, assigns the main CoS pages a high relevance, and thus they're the first to come up in any scientology-related search.

    The moral being, for any new cool search technique devised to help fetch more relevant content, there'll be someone out there looking for a way to defeat it.

  12. Degrees of separation? by delphin42 · · Score: 2, Interesting

    I don't really buy into the idea that linked pages will necessarily be related to the same subject. Look at sites like slashdot or cnn, which link to a variety of pages in totally disparate subjects. If you applied transitivity, then you'd end up with every connected page on the web being on the same subject. Page A links to B, so B is probably on the same subject as A. Page B, links to page C, and therefore is probably on the same subject as B, and therefore A as well...Oops. The task of categorizing pages, unfortuately may always have to be done by humans to be done well. This type of software might help categorizing websites, but it won't replace wetware anytime soon.

    --
    -- Adam
  13. Hmmm, so it's a "Web"? by Fencepost · · Score: 3, Interesting

    So it's actually working on the basis of webs of related sites - not a novel concept, but useful.

    I suspect that some of the commercial knowledge management tools have been doing something much like this for some time, and TheBrain.com has had a product to manually build this kind of network of clusters for some time. The key thing about this is that with web indexing/cataloging the information needed to do the automatic linking is available.

    TheBrain.com seems to have a working demo of using it for the Web at WebBrain.com based on the Open Directory Project. It's not a great example because of display limitations that don't really let you see more than one cluster of information at the time, but it's one example of the general concept. Once you dig down in an area you can see how it shows links between related categories as well.

    Note: the demo above says it requires Java 1.1 and IE 4.01 or Netscape 4.07+, to bypass that test try here. Seems to work fine in Netscape 6.2, and will probably be OK in Mozilla if the JRE is available.)

    --
    fencepost
    just a little off
  14. why not a 3d search engine? by Lumpy · · Score: 5, Interesting

    How about displaying on a 3D image dots as file with the zero point of xyz being the absolute most significant and the nspread out the hits from there? that way we can zoom in on what interests us in the search.

    I.E. search for linux apache router

    linux is one axis, apache is another and router is a 3rd. if the pages are relevent in that context then the closer to zero they will show up while linux apache donuts will resolve close to zero on the XY but be way out on the Z axis..

    --
    Do not look at laser with remaining good eye.
  15. This is not new work by Anonymous Coward · · Score: 5, Interesting

    This work is not new. In fact I submitted my undergraduate thesis on this topic. The roots of this work is really in citation analysis where the idea is that references that are highly cited are high quality references (this is the idea that Google is built around). Extending this to the web, a "reference" can be thought of as a "link" and you can generalize the hypothesis to the idea to: "similar works link to each other" and therefore you should be able to find communities of similar documents by following links within documents.

    Intuitively this seems reasonable and in practice this is often the case when there is no conflict of interest for a document to link to another document (as in the case of researchers linking to other works in their field). Yet, often this is not the case when there *are* conflicts of interest (a pro-life site will probably not link to a pro-choice site;BMW will probably not link to Honda or any of it's other competitors). Therefore, since the truth of the hypothesis that "similar documents link to each other" is not clear, I worked to test this very idea.

    To do this I used The Fish Search, Shark Search, and other more advanced "targeted crawling algorithms" that take connectivity of documents into account (as is discussed in the Nature article), but these algorithms often go further than just using the link relationship by taking the contextual text of the link itself as well as the text surrounding the link into account too when choosing which links are the "best ones" that should be followed in order to discover a community of documents that are related in a reasonable amount of time (you'd have to crawl through a lot of documents if documents have as few as, say, 6 links per page on average! Choosing good links to follow is crucial for timely discovery of communities). The conlusion of my thesis was that it is (unfortunately) still not clear whether the hypothesis holds. I only did this work on a small subset of web documents (about 1/4 million pages) so perhaps a better conclusion would be reached by using a larger set of documents (adding more documents can potentially add more links between documents in a collection). What I did discover however, was that if document communities do exist, you have a statistically good chance of discovering a large subset of the documents in the community by starting from any document within the community and crawling to a depth of no more than 6 links away from the starting point. (This turns out to be useful to know so that your crawler knows a bound on the depth it has to crawl from any starting point). Moreover, if you have a mechanism for obtaining backlinks (ie. the documents that link to the current document) you can do discover even more of the community...

  16. Many web-marketting businesses based on this by SimplyCosmic · · Score: 3, Interesting

    The CoS isn't the only ones who try to use this technique in order to make their sites rise in search engine ratings.

    There are a number of those "Get More Hits For Your Website Cheap!" sites which try to do so by getting member sites to download an html file which contains links to most of their members, and then have you link this from your own site.

    Much like a pyramid scheme, as new members join the get the same file with links to your site, thereby increasing the number of sites with links to you and possibly raising your position in search engines.

  17. This Could Actually Help Enhance Accuracy by FreeUser · · Score: 5, Interesting

    You make a very interesting point:

    Problem with this: "most" websites do not link to sites with similar content. Most websites link to "partner" sites that have nothing in common with them -- after all, who links to a competitor?

    Good websites link to similar sites -- academic websites link to simialr sites and sources.


    Combine the algorithm described in this article with google's approach (or some other contextual approach to deterimining relevance) and you not only have a way of identifying "communities," you have a way of easilly identifying "marketdroid mazes of worthless links" as well.

    Since the content of most marketdroid sites is usually next to worthless, the hits for a given search could be ordered accordingly. Sites, and groups of sites, that clearly form communities related to the topic you're interested in at the top, single websites as yet to be linked to somehwere in the middle, and marketdroid "partner" sites at the very bottom.

    This would actually produce better, more useful results than either approach alone.

    --
    The Future of Human Evolution: Autonomy
    1. Re:This Could Actually Help Enhance Accuracy by Shotgun · · Score: 3, Interesting

      A decent implementation of the algorithm would search for and rank pages as it currently does, but then 'communitize' the results. If result 5 is in the same community as result 2, don't put 5 on the page. Instead, change result 2 to point to another page that will list most of the community that 2 and 5 share, and then increase the ranking of 2 or just list the 'communities' first in the result. This will greatly shorten the results that must manually be looked at by categorizing them.

      --
      Aah, change is good. -- Rafiki
      Yeah, but it ain't easy. -- Simba
  18. Oracle of Bacon by harmonica · · Score: 3, Interesting

    In addition to the other replies, here's the link to the Oracle of Bacon that lets you find out the degree of separation between Kevin and any other person who is featured at IMDB.

    There is also a generic search that lets you combine any actor with any other actor. Unfortunately I have forgotten who the best-connected actor was (average to all other actors is smallest). Anyone?

  19. Unfavorable to E-commerce by version5 · · Score: 2, Interesting
    This model doesn't favor businesses at all. If there's one thing a commerce-oriented website won't do, its put a lot of links to their competitors. Depending on your point of view, this is a great thing. Unfortunately, many businesses believe that advertising means getting in your face as much as possible, there's no such thing as bad press, etc.

    Amazon.com is an example of this: I bought a pair of speakers from them a several months ago, and yet every time I go there, they helpfully inform me that they have these great new speakers on sale! Buy now! I suppose it works to recommend similar books and CDs, but when someone buys speakers, they usually stop being in the market for speakers after that.

    Anyway, I don't know why no-one has thought of making an e-commerce-only search engine. I think there's a clear distinction between those two types of searching that warrants a separate engine. Sometimes you want to buy stuff, and sometimes you just want information. When you are doing one, the other just gets in the way, and disguising advertising as content like AOL/MSN/AltaVista do just discourages you from using their services. Obviously, web-based businesses have a long way to go before they actually realize, "Oh, Internet users don't like to be tricked! Maybe if we were straight-forward with consumers they'd be more trusting of us."

    --

    "It's Dot Com!"

  20. What about for none western cultures? by Com2Kid · · Score: 3, Interesting

    Anybody else here (well, obviously, likely quite a few people) ever browse around foreign (to me at least. :P ) language sites?

    The culture that exists there-in is defiantly quite different.

    Japanese sites are even MORE self referencing then American sites. This trend has taken off onto American sites though in the form of Cliques, which themselves tend to lie outside of the sites that many of us /.'ers typically frequent.

    Seriously though, in Japan it seems that sites actually have others ask permission to link to them! (As an aside, whenever that topic is brought up here on /. people tend to get all freaked out about it. ^_^ )

    This obviously creates a VERY different social structure that heavily alters the dissemination of information, not to mention the way that sites are linked together.

    Here in the states (or any other culture that has pretty much a free linking policy) it is common to say "oh yah, and for more info go to this site over here and also this site here has some good information and and and . . . . "

    Anybody who reads www.dansdata.com knows how he (uh, Dan obviously. :) ) likes to sprinkle relevant to kind of relevant links throughout his articles and reviews. Almost all of the links are VERY interesting and much can be learned from them (he does link to e2.com quite a bit though. :P ) but that many of the links are to further outside resources on the same topic.

    (such as a LED light review having links to the Online LED museum)

    In a culture where linking is no so free, I would think that there would be more of a trend towards keeping a lot of the information in-house so to speak, and thus at the very least the bias's that the search engine uses to judge relevance of links would have to be altered a bit.

    Links would have to be given a higher individual weight, since their would be a larger chance of them being on topic.

  21. Credit where credit is due by Quixote · · Score: 3, Interesting

    Prof. Jon Kleinberg of Cornell did this work many years ago. IIRC, he was the first to come up with the idea (first published in 1997). Check out his list of publications for the work (and related stuff).
    Disclaimer: I happen to know him, but this is not biased.

  22. *BZZT* please try again... the real origin: by Technodummy · · Score: 5, Interesting

    "Before all the ruckus of living in a "global village" where we are all connected via the internet, there was the idea of "six degrees of separation," or the "small world theory." The theory posits the idea that everyone in the world is separated from everyone else by only an average of six people. That is to say, the only thing which separates you from the President, the Pope, a farmer in China, and Kevin Bacon is six people.

    It's a strange and beautiful concept. It is fascinating to think that we are all in some way interrelated by only six people or that we have some connection to people even in the remotest part of the world.

    The "small world" theory was first proposed by the eminent psychologist, Stanley Milgram. In 1967 he conducted a study where he gave 150 random people from Omaha, Nebraska and Wichita, Kansas a folder which contained a name and some personal data of a target person across the country. They were instructed to pass the document folder on to one friend that they felt would most likely know the target person.

    To his surprise, the number of intermediary steps ranged from 2 to 10, with 5 being the most common number (where 6 came from is anyone's guess). What the study proved was how closely we are connected to seemingly disparate parts of the world. It also provided an explanation for why gossip, jokes, forwards, and even diseases could rapidly spread through a population.

    Of course, the six people that connect you and the President aren't just any six people. The study showed that some people are more connected than others and act as "short cuts," or hubs which connect you to other people.

    Take for example, your connection with a doctor in Africa. Chances are your six childhood friends who you've grown up with aren't going to connect you to someone across the country, much less across the ocean. But let's say you meet someone in college who travels often, or is involved in the military or the Peace Corp. That one person who has traveled and has had contact with a myriad of other people will be your "short cut" to that doctor in Africa.

    Likewise, say that you want to figure out your connection to a favorite Hollywood socialite. If you have a friend who is well connected in the Industry, that person will act as a bridge between your sphere of existence and the Hollywood circuit.

    The Proof

    Mathematicians have created models proving the validity of the "small world" theory.

    First, there is the Regular Network model where people are linked to only their closest neighbors. Imagine growing up in a cave and the only people you have contact with for the rest of your life are in that cave with you.

    Then there is the Random Network model where people are randomly connected to other people regardless of distance, space, etc..

    In the real world, human interconnectedness is a synthesis of these two models. We are intimately connected to the people in our immediate vicinity (Regular Network), but we are also connected to people from distant random places (Random Network) through such means as travel, college, and work. It is by our intermingling with different people that our connections increase.

    You may meet someone in class that is from a different country, or whose father works in Hollywood, or whose mother owns a magazine. By this mingling and constant interaction your potential contact with the rest of the world increases exponentially.

    The Internet

    The Small World theory is interesting in light of recent advances in communication technology--namely, the internet.

    You can now instantly make contact with someone across the world through a chat room, email, or through ICQ. In all of human history, it has never been easier to get in tough with someone across the globe.

    The great irony, of course, is that although we are making contact with such a vast number of people, the quality of the contact is becoming terribly depersonalized. Our email, chat, and ICQ friends may number in the hundreds, but for the most part we'll only know them as a line of text skittering across the screen and a computer beep.

    That's not to say that there is never a cross over from the virtual world of the internet to the "real" world. But a majority of the time, the closest you'll get to actually meeting your fellow e-buddies in the flesh are the pictures they email you (notice how everyone oddly looks like Pam Lee or Tom Cruise), or a series of smilies (meet my friend Sandra :), Jenny :P, Bill :{, and Chrissy 23).

    Never in the history of mankind has there been so much technology to keep us connected.This is with so little true connection. Everything from cellular phones, pagers, voice mail, and email were designed so that we would never be alone again. Human contact would only be a few convenient buttons away. But what seems to be happening is that the convenient buttons are superceding real people. Despite the appearance of all this technology, we're still pretty much where we started, with the exception of a motley crew of digital displays, flashing lights, and cutesy computer alerts to keep us company.

    Don't get me wrong. The Internet Revolution is great and is making our lives easier. But as with ice cream, money, and sex -- too much of a good thing can be bad (money and sex are sometimes exceptions). What good are all the conveniences and promises of instant material gratification if you don't really live. The virtual world is good, but we shouldn't forsake it for the real world. The macabre image in the Matrix where we are all plugged into computers unbeknownst to us is a parable of what could be our future. A future where people never leave their homes and where we're all so dependent on computers. We wouldn't be able to walk outside without a pang of separation anxiety.

    As we enter the new millennium, there is no doubt that we will be living increasingly wired existences. Perhaps Milgram's study will be annotated, and perhaps we will find that we're only separated by three degrees of email. But what good is that if the only "handshakes" going on are between our computers??

    Russ