Interesting Concepts in Search Engines

Anything new ? by Anonymous Coward · 2002-03-07 08:24 · Score: 0

Is this new stuff ? Doesn't google already doing this and more ?

google contest... by edrugtrader · 2002-03-07 08:24 · Score: 1

this was suggested a month ago when google announced the contest... looks like someone over there reads /.

--
MARIJUANA, SHROOMS, X: ONLINE?! - E

Re:google contest... by Agent137 · 2002-03-07 08:26 · Score: 1

Looks more like the same kind of 'expert links' algo that http://www.teoma.com/ uses
Re:google contest... by -brazil- · 2002-03-07 23:32 · Score: 1

I saw a presentation about this method (or one with the same result) about half a year ago on a CS conference.

--
The illegal we do immediately. The unconstitutional takes a little longer.
--Henry Kissinger

But.... by ElDuque · 2002-03-07 08:24 · Score: 3, Interesting

Where would Slashdot fit in to this? There's links to everywhere!

Re:But.... by bourne · 2002-03-07 08:33 · Score: 5, Funny

Slashdot must be the Kevin Bacon of the online world...
Re:But.... by eet23 · 2002-03-07 08:38 · Score: 2, Interesting

But /. generally links to things that /.ers find interesting. I imagine that sites like this would be a good way of linking wider subject areas (computers + popular science) together.
Re:But.... by Dudio · 2002-03-07 08:53 · Score: 3, Interesting

It's not just Slashdot either. Blogs by their very nature link to sites/pages about anything and everything. If they could manage to programatically identify blogs with high accuracy, maybe they could develop a hybrid with Google's algorithm so that focused sites are crawled for links to related sites while blogs and such are used to cast popularity/usefulness votes.
Re:But.... by kent_eh · 2002-03-07 09:23 · Score: 1

But /. generally links to things that /.ers find interesting

Like Goatse.cx ?
The number of links to that here (which I intentionally didn't add to) would indicate that it's highly relevant to /.

--

---
"I can't complain, but sometimes still do..." Joe Walsh
Re:But.... by neuroticia · 2002-03-07 09:26 · Score: 1

I think it would be fairly easy to diffrentiate between blogs and "pages of a feather". Blogs would have a highly diverse set of links--one moment linking to slashdot, then linking to a page with the definition of the word "pink".. A porn site, a friends site, etc. "Pages of a feather" would link to all one type of site (usually)and based on that they could lump the pages into two rough categories.

-Sara
Re:But.... by susano_otter · 2002-03-07 09:36 · Score: 2

Well, maybe it is highly relevant to /. After all, it does get linked a lot. I remember when it started, too. The article about the new .cx TLD domain was featured, and /. promptly predicted that it would only be a matter of time before someone registered "goatse.cx". I don't think a week passed before the first goatse.cx link appeared on the site.

--
Any sufficiently well-organized community is indistinguishable from Government.
Re:But.... by bigWebb · 2002-03-07 10:32 · Score: 2, Insightful

Slahsdot fills a different niche, one more similar to that of newspapers than to research sites. The purpose of slashdot is to provide an overview of many different areas. The new search idea would, as I understand it, work best on very narrow (esoteric perhaps) fields, such as the research into a certain area of a certain discipline. Because slashdot is more general it won't have the same community type setup and will likely not be effected by this new way of searching
Re:But.... by -brazil- · 2002-03-07 23:34 · Score: 1

Sites that have too many links, or are linked to too often, are discarded before running the algorithm for this reason.

--
The illegal we do immediately. The unconstitutional takes a little longer.
--Henry Kissinger

NSA, anyone? by MAJ+Rantage · 2002-03-07 08:24 · Score: 1

Thus, for academics, this would be great as the engine could find the community of pages related to a certain subject.

And for intelligence services, a great way to more quickly compile open source intelligence.

Re:NSA, anyone? by Anonymous Coward · 2002-03-07 11:26 · Score: 0

Hmmm....where's this intelligence that you speak of?

Content filter by anon757 · 2002-03-07 08:25 · Score: 1

The idea of it being used as a content filter is interesting. Presumably, you would only be able to get to pages that were part of the 'community' of that information. Of course, there will be problems with this too, but it may end up being better than just content filtering by text strings.

--
The (Hopefully) Great Slashdot Blackout Apr 21-27

Re:Content filter by Drachemorder · 2002-03-07 08:58 · Score: 2

As long as we have the option to not use the filter if we don't want it, I think it's probably a good thing. Anything that has the potential to increase the relevance of my search results is good.
Of course, it could also be used to keep you from seeing things they don't want you to see. Then again, most technologies carry that risk, I think.
Re:Content filter by neuroticia · 2002-03-07 09:45 · Score: 1

It would be cool if a search engine had a series of checkboxes for different filters to apply. "Search by community via links", "search for community via text", "search by community via metatags", etc. Then you could make the search broader/more narrow both by the checkboxes you choose and by the search terms.

Or maybe a "extensible" search engine where you can enter in your own algorithms in some sort of scripting language. Of course, that would only be useful for /.'ers.

-Sara
Re:Content filter by cliche · 2002-03-07 09:52 · Score: 1

There already is a filter that works on that princible, and it also blocks i think 2 or three steps of links from the pages on their list

Isn't that by Anonymous Coward · 2002-03-07 08:25 · Score: 0

how Google works now?

Re:Isn't that by swagr · 2002-03-07 08:30 · Score: 1

You're right. Google rates pages according to how many references they get from similar pages.
The only thing that this "revolutinary" engine does is show the references.
Note that this may not even be usefull. I can link to a good site, but that doesn't make my site atomatically good.

--

-... --- .-. . -.. ..--..

Just like people surf by Jafa · 2002-03-07 08:25 · Score: 3, Insightful

This seems pretty cool. The interesting part is that it mimics how people surf anyway. When you find a link from a search engine now, what's your usual routine? Go to the page, look around, find another interesting link, go to that page, maybe go back one and link away again... So this can pre-define that 'island' that you would have manually browsed anyway, but hopefully with better results.

Jason

Re:Just like people surf by fogof · 2002-03-07 08:33 · Score: 1

Ppl usualy surf in a depth first maner. I would think that to define the islands better you would need a breadth first crawler. I guess (in my opinion) it would mimic to a certain extent.

--
--=.=-- www.cyber2000.qc.ca
Re:Just like people surf by T3kno · 2002-03-07 08:50 · Score: 1

Hmm, this just made me realize that I must not follow the "normal" browsing patterns (if there are really such things). A typical search for me starts out by me going to google and entering my, usually very specific, query. I will then start looking at about the first 10 hits, reading the blurb, and going to the ones that I think will answer my question. Once there I wll give the site a really quick look over for the information that I am looking for. If I do not see the information I will quickly hit the back button and go to the next on the list. Unless a site give's me the information I am looking for very quickly I will not stay long, and I very very rarely even bother to look for links from that site. If a once through the top fifty doesn't do any good, I will usually refine my query. I say usually because I am never consistent. I must say that when I do find a good site, and the links good I will bookmark that site and use it often. I really dont think type of searching would interest me that much because I don't like "prepackaged" search results, mostly because my ideas of categories dont flow with what someone elses ideas are.

--
(B) + (D) + (B) + (D) = (K) + (&)
Re:Just like people surf by John+Hasler · 2002-03-07 08:50 · Score: 1

"Go to the page, look around, find another interesting link, go to that page, maybe go back one and link away again..."

I rarely do anything remotely resembling that. My usual routine is: go to a link, find nothing interesting, go back to Google, and repeat until I either find what I want or give up and reformulate my search. It is rare that a page that does not have what I want has links that seem likely to.

--
Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
Re:Just like people surf by rjamestaylor · 2002-03-07 10:15 · Score: 1

It's March - time to change your .sig ...

--
-- @rjamestaylor on Ello
Re:Just like people surf by jaju · 2002-03-07 20:30 · Score: 1

When you surf, you look at the surrounding text too.
What's interesting here is that they claim that they completely ignore the text of the page.

--
People will do tomorrow what they did today because that is what they did yesterday.
Re:Just like people surf by Where's+my+towel · 2002-03-08 00:13 · Score: 1

You and half the rest of the net's users http://www.useit.com/alertbox/9707b.html quote: "half of all users are search-dominant, about a fifth of the users are link-dominant, and the rest exhibit mixed behavior."

not very revolutionary if you ask me by Anonymous Coward · 2002-03-07 08:27 · Score: 0

wow, now search engines can reutrn webrings........how not revolutionary is that?

stupid timelimit on posts crap....no way I'm going to get fp now ;)

Problem. by DohDamit · 2002-03-07 08:27 · Score: 3, Interesting

So...the engine crawls through, looking at links, goes to those sites, and looks at more links. So on and so on, until it has a web of links defined. The problem with this approach is that they have to have a VALID starting point OR a valid ending point in order for this method to be of use. In other words, either they have to manually start from a good site for physics, such as Stephen Hawking's homepage, or wind up at a good site for physics, such as oh, Stephen Hawking's website, in order to determine what's a good physics site. In the end, the content still has to be managed, or a porn site manager can still get around all this by linking to all kinds of sites, rather than stuffing their text/metatags. In the end, this solves nothing.

Re:Problem. by Pussy+Is+Money · 2002-03-07 09:02 · Score: 1

Yer even dumber than your handle suggests. Linking to someone doesn't add as much value as being linked by someone. Linking to someone is not as expensive as being linked by someone. In this way you can establish the "cost" or "value" of each page as you crawl the a tree and have it hardly matter where you start.

--
Pushin' 'n dealin', shovin' 'n stealin'
Re:Problem. by Cheeko · 2002-03-07 09:04 · Score: 1

From some classwork I did a few years back on the algorithm that runs google, I recall something about links TO a page or this may have just been a theoretical idea covered at the same time, my memory is fuzzy. The point is, that perhaps it doesn't use links FROM a site, but rather links TO a site or perhaps a combination of the two. So a good physics site would be one that Hawkings page links to, as do lots of other sites, that when indexed return good results for physics. I'm also assuming that the algorithm doesn't really purely on links, but uses links as well as some traditional indexing methods to achieve better results. I mean how good can a physics page be if the term physics doesn't appear in the page in any way, shape, or form.
Re:Problem. by DohDamit · 2002-03-07 09:21 · Score: 1

Good to see we can keep it on a moderately intelligent level. Fine, you can't take the next step, so let me lay it out for you. All it takes is one faked physics/education/sports/religion site to be linked into for all the spammy sites to be brought into the web. Granted, most of the faked sites won't get into the web. But all it takes is one. Hell, are you telling me, you arrogant AND stupid fuck, that you can't see a computer science student of physics student linking to a semi-obnoxious site?

Oh one more facet. It may not add as much to link as it does to be linked to, but if I'm link to enough high quality sites, what my site lacks in quality of relevancy will be more than made up for in the quantity of the links, buried on the site behind an image.

Try behaving like a human, rather than a bastard troll who has no friends. Yes, I know, its a stretch.
Re:Problem. by Pussy+Is+Money · 2002-03-07 09:46 · Score: 1

If you link to high-quality sites then you are adding more value to the high-quality sites than to your own. Sure there are ways to hack the system. You can pay people to link to your site for instance. If that is not an option, then it has always been possible to hack SciAm into publishing a hoax that you concocted.
So the system is hackable. So what.

--
Pushin' 'n dealin', shovin' 'n stealin'
Re:Problem. by Paul+Komarek · 2002-03-07 10:31 · Score: 2

It's pretty easy to discover a group of porn sites heavily interlinking in order to increase their inbound link count. These parts of the web are abnormally cliqueish, by which I mean they approach a fully interconnected set of nodes. Most of the web doesn't work like that.

It's not hard to find popular sites using this methodology, and in this case "popular" is probably as close as you'll ever come to defining a metric for what makes a website a good website. It all depends on what physics (or whatever) sites people link to, which hopefully will be related to how good those sites are. Note that all of the link counts need to take into account some sense of "community" -- i.e. The magazines Popular Science and Science serve very different communities. So link counts need to be taken relative to other sites "around" them, or some such.

And in the end, this solves a lot of things. For instance, the algorithms will be independent of written human lanugage. They'll also be more robust when classifying pages that use graphics for scientific typesetting (LaTeX) constructs that aren't available in HTML (yet). This is important.

-Paul Komarek
Re:Problem. by Alsee · 2002-03-07 14:51 · Score: 2

It may not add as much to link as it does to be linked to, but if I'm link to enough high quality sites

While I can't say for certain without looking at the exact algorithm, it sounds like good outgoing links would add zero weight to your site. I don't think it would be fooled at all by "links, buried...behind an image"

All it takes is one faked physics/education/sports/religion site to be linked into for all the spammy sites to be brought into the web.

Again, based on my understanding this would also fail. I believe it is measuring multiple inbound routes. While someone might be able to get a "faked" page ranked highly, it would act as a choke point and only add one "point" spread across the "spammy" sites, and wouldn't bring them in.

In order to fool this thing you would have to create several high ranked pages and point them into the "spammy" cluster. It would be quite a bit of work to make enough sitesgood enough to draw in the needed valuable links. It would also be adding valued content to the web - a public service.

If someone does enough work and contributes enough public service to pull it off, you could say he earned whatever he gets out of it.

-

--
- - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.
Re:Problem. by -brazil- · 2002-03-07 23:39 · Score: 1

It's probably a misrepresentation of the method. The way I heard it, you first take the result of a general webcrawl and then run an algorithm on the resulting graph that discovers the communities. Then you classify the resulting communities using keyword methods.

--
The illegal we do immediately. The unconstitutional takes a little longer.
--Henry Kissinger

I feel sorry for... by nigelthellama · 2002-03-07 08:28 · Score: 1, Funny

...the poor bastard that searches for pr0n on this search engine. Holy Jesus, can you imagine the links?!?

Re:I feel sorry for... by sharkey · 2002-03-07 09:16 · Score: 2

can you imagine the links?!?

Give me a sec...Ohhh...Ahhh...Mmmmm...
Excuse me, I need a cigarette and a tissue.

--

--
"Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.

What you say?! by Timmeh · 2002-03-07 08:28 · Score: 0, Interesting

Better than Google? Blasphemy.

Knowing google, they're probably already working on such a thing. Also, I question how well it would work, I'm not saying it doesn't, as they've already shown it to work, but completely ignoring the website's text is like taking two steps forward and three steps back. Wouldn't the ideal search engine combine the best of both worlds? Checking text and links?

Interesting by matp · 2002-03-07 08:29 · Score: 1

I get the feeling that there's much more to be done with link analysis. If you think about it, it's very similar to Sociology. There's elements of population and community as well as migration and economics. Do you think Tim Berners Lee could have imagined what he was letting the world in for back in 1990.

Re:Interesting by Anonymous Coward · 2002-03-07 08:51 · Score: 0

"I get the feeling that there's much more to be done with link analysis. If you think about it, it's very similar to Sociology. There's elements of population and community as well as migration and economics. Do you think Tim Berners Lee could have imagined what he was letting the world in for back in 1990."

I do think he did. I also think he wanted even more. Stuff like people creating rich content in their browsers and then hosting it from them ect. We may have hit the consumer aspect potential, but I think we are far from his ideas'/aspirations on producing and actually communicating. Especially still in the areas of cross platform compatibility.

But that would mean... by Anonymous Coward · 2002-03-07 08:29 · Score: 0, Funny

Slashdot would wind up in the same "community" as goatse.cx. Hmm...

Bad Idea - What Happens to Science? by Ieshan · 2002-03-07 08:30 · Score: 3, Interesting

What happens to journal articles relating to specific content? How do I find information for biology class?

Currently, I can search google and find things on the destruction of Balsam Fir in Newfoundland by Alces Alces (Moose), with this type of search engine, the journals wouldn't be listed because they themselves don't have links to anywhere (most of them are straight magazine to html conversions or PDF).

It'd be difficult as hell to find pertinent information above the level of "3y3 4m Johnny, And Dis 1s Mai W3bsite, 4nd H3r3 Ar3 Mai LinkZorz!"

Re:Bad Idea - What Happens to Science? by TheMatt · 2002-03-07 08:55 · Score: 3, Interesting

Ah, but there might be links. Most research group pages currently have links to their latest research in the journals. For example, my group has links to J.Chem.Phys.Online or the like, directly to the journals. This type of search could lead you to the journals that are in the area you searched (JCP for me, TetLett for an OrgChemist, etc.)

Plus the fact that groups mainly link to others doing the same work. So, I can start at one page and soon get an idea of the cluster science community, for example.

--
Fortran programmer...oh yeah. Array math for life!
Re:Bad Idea - What Happens to Science? by Anonymous Coward · 2002-03-07 09:34 · Score: 1, Informative

If your research institution has a subscription, you can always use something like Web of Science, formerly known as the Science Citation index. This is a much better tool for finding papers in refereed journals about a particular topic than just searching the web, whatever engine you use. Alternatively, you can search the web on a particular topic, find out who some of the important researchers are, and search Web of Science for their papers.
Incidentally, Web of Science also indexes Humanities and Social Science publications.
Re:Bad Idea - What Happens to Science? by Alsee · 2002-03-07 15:49 · Score: 2

Bad Idea - What Happens to Science?

As I understand it, the exact opposite would be true.

If you type "Balsam" into google all the top links are related to balsam products, Balsam Lake and Balsam Beach. Instead imagine you could click on the science community, and then perhaps on the biologist sub-community. Now when you type in "Balsam" you get a list of sites that are the most referenced by biologists.

This can be particularly useful when differnt groups uses a word in very different ways. Inflation means something very different to physisists than it does the general public.

-

--
- - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.

DO NOT MESS WITH GOOGLE! :-) by inerte · 2002-03-07 08:30 · Score: 2, Interesting

http://www.google.com/search?hl=en&num=10&q=relate d:slashdot.org/

http://www.google.com/search?as_lq=www.slashdot. or g&btnG=Search

--
Buy a Nintendo DS Lite

Meaningful category information by pinkUZI · 2002-03-07 08:30 · Score: 1

This is great! Does this mean that we will finally have a search engine with categories that actually mean something? Until now, categories have mostly been self-proclaimed by the submiter to the search engine or the meta-tags on the site itself.
This could mean that browsing by category will become more and more useful in the future.

--
You are receiving this message because your browser supports Slashdot Sigs and you have Slashdot Sigs enabled.

Re:Meaningful category information by Saige · 2002-03-07 09:51 · Score: 1

Does this mean that we will finally have a search engine with categories that actually mean something?

The categories used by Google are just the actual content of the Open Directory Project, which is all done by human volunteers. Google doesn't just assign categories based on page content, but what DMOZ editors did.

If you have a problem with the categorization found on Google, then go to the ODP, sign up, and help out yourself.

--
"You know your god is man-made when he hates all the same people you do."

Browser integration by ZaneMcAuley · 2002-03-07 08:30 · Score: 2, Interesting

Would be nice to have a sidebar for IE or any other browser that could be used as a filter for relevent topics from that site configurable with N depth for searching. Highlighting of relevant topic links and maybe even a graph view.

--
----- Whats wrong with this picture? http://www.revoh.org:1234/whatswrong

Sparse on details and a working demo by wdavies · 2002-03-07 08:31 · Score: 2

I've just searched Google for links to this author and his system, and can't find anything other than a citeseer reference to a project called Deadliner.

Anyway, this would be a much more interesting submission if there had been links to how the algorithm dealt with the computational complexity, or had a site we could Slashdot :-)

Winton

Re:Sparse on details and a working demo by Anonymous Coward · 2002-03-07 08:41 · Score: 0

Deadliner isn't being worked on anymore, I don't think. This project is by Gary Flake & Steve Lawrence at NEC (Steve made Citeseer/Researchindex, Gary does work with web communities) and Frans Coetzee, who used to be at NEC.
http://www.neci.nj.nec.com/homepages/flake/ and
http://www.neci.nj.nec.com/homepages/lawrence/
Re:Sparse on details and a working demo by jsprat · 2002-03-07 08:47 · Score: 3, Informative

His homepage

A postscript document detailing his research.

Also, if you're a member of IEEE Computing, you can see his publication.
Re:Sparse on details and a working demo by persaud · 2002-03-07 17:16 · Score: 1

Details in IEEE Computer March 2002, preprint here.

Working demo of 6000 sites related to 9-11-01.

Rich

Online bookmarks. by fogof · 2002-03-07 08:31 · Score: 2, Insightful

I remember my first page:
My fav site on the internet.
A list on unrelated pages all liked from one spot.
I wondering if there any of those left. And how the search engine would cope with them.
And another point. The article states that new categories can be found. How is the "crawler" going to define the name of the new categories? I feel that the article was too short on details. I mean as a concept it's great. But more information would be cool.

--
--=.=-- www.cyber2000.qc.ca

Re:Online bookmarks. by winjer · 2002-03-07 22:24 · Score: 1

I remember my first page:
My fav site on the internet.
A list on unrelated pages all liked from one spot.
I wondering if there any of those left. And how the search engine would cope with them.

Cope? It does cope. The links aren't all unrelated, they have one major thing in common: they all were chosen by you. What the engine will "percieve" to be a "relation" or a "category" will not neccesarily be limited by currently existing official collections of such.

And another point. The article states that new categories can be found. How is the "crawler" going to define the name of the new categories?

Well, once it identifies a "clear and distinct" category, a cursory glance by an intelligent, unbiased librarian should be sufficient to produce a label. Or just use linkwords or meta tags or whatever we use now to come up with a few contenders...big deal.

The demise of another search engine? by indole · 2002-03-07 08:31 · Score: 3, Interesting

As much as I (and all of you) love Google, I wonder whether their moral high ground approach to search results would not exist if they did not already have the worlds traffic searching through their site.
Search engines come and go. When Google has to struggle for its existence against the Next Big Thing, how many of you really believe they won't sell out in order to keep themselves running, in effect putting the last nail in their own coffin.

We shall see.

--
(2,3-Benzopyrrole)

Re:The demise of another search engine? by greenfly · 2002-03-07 08:55 · Score: 2

Google had this "moral high ground" from the beginning though. It was something that they built everything on top of.

I'm not saying they wouldn't "sell out"(however a business selling a product could sell out that is), but it seems that their text-based ads work well there, and that they also get a good amount of revenue selling their search tech.
Re:The demise of another search engine? by JoeBuck · 2002-03-07 10:37 · Score: 2

Google's competitive advantage is their reputation. At this stage, any attempt at sellout would backfire badly: anyone willing to pay them money for a better listing will want to stop paying when no one visits Google any more.
Re:The demise of another search engine? by Anonymous Coward · 2002-03-07 18:03 · Score: 0

Google had this "moral high ground" from the beginning though. It was something that they built everything on top of.
That's easy, if no one pays you any attention. How about my homepage? I guarantee I will never accept any money for links on my links page. Well, gee, that was tough. I wonder if when someone comes along when I've filed for bankrupcy and offers me a bunch of $$$ for a link... I'm sure I'll stick to my morals then. NOT!

Isn't this just a subset of Google by nrosier · 2002-03-07 08:31 · Score: 2, Insightful

AFAIK, Google uses several criteria:
It looks off course for the words from your search but also at the words close to those (so if you look search string is 3 words and it finds them next to each other it gets a higher score than the words randomly found in the text). It also look at the links. Pages about the same topic that are linking to your page give a "vote" for your page. This looks a lot like the "new" search algorithm. Or is the new one the inverse? In stead of giving a vote to, it receives votes if it links to pages about the same topic.
The one thing I'm thinking is that they miss a lot of pages just because they do not contain links.
Anyways, there isn't a lot I haven't found on Google yet (thanks to all it's search engines: regular, open directory, images, news...)

Re:Isn't this just a subset of Google by Anonymous Coward · 2002-03-07 10:26 · Score: 1, Informative

This sounds like a subset.

I've seen a page of Google search results where a "Related pages" link was provided below certain search results.
Re:Isn't this just a subset of Google by Anonymous Coward · 2002-03-07 12:27 · Score: 1, Informative

This *is* a subset of Google. It's well known that a site talking about an art topic that is linked from many sites that rank high on art and link heavily among themselves will rank higher than the exact same site if it is linked by sites that are themselves linked from few art sites, no matter how heavily they are linked in other domains.

Look out! by legLess · 2002-03-07 08:36 · Score: 2

I hope he patents this before Google steals his idea!

Joke

--
This isn't as much "normalization" as it is "don't take so many drugs when you're designing tables."

Re:Look out! by Anonymous Coward · 2002-03-07 09:22 · Score: 0

I'm a little confused here - isn't this just another one of those "Well, duh!" advancements like Amazon's flash-bang gee-whiz amazing "One-click" technology? Granted, that stuff's all the rage to patent these days but ...

It's basically one obvious step beyond the obvious - I'm a little surprised you guys think it's so cool.

I wish! by dasmegabyte · 2002-03-07 08:37 · Score: 5, Interesting

Problem with this: "most" websites do not link to sites with similar content. Most websites link to "partner" sites that have nothing in common with them -- after all, who links to a competitor?

Good websites link to similar sites -- academic websites link to simialr sites and sources. This type of search engine would be killer on Internet 2. But on our wonderful, chaotic, porn and paid link filled Internet 1, it's useless. Spider MSN and you'll get a circular web leading to homestore, ms.com, Freedom To Onnovate, ZDNet and Slate. Spider Sun and never find a single page in common with their close competitors like IBM.

What happens when sites get associated with their ads? Search on Microsoft Windows and grab a lot of casino and porn links...because a "security" site covered in porn banners was spidered and came up with top relevancy.

Now, combined with a click-to-rate-usefulness engine like Google, this could be an interesting novelty. But it'll never be the simple hands off site hunter the big Goo has become.

--
Hey freaks: now you're ju

Re:I wish! by Anonymous Coward · 2002-03-07 08:49 · Score: 2, Interesting

"after all, who links to a competitor?"

Well when you are not dealing with commercial sites, or even when you are some times, a lot of people.

Google (and most other search engines) link to tons of other search engines, Art Galleries link to other Art Galleries, and gaming sites link to other gaming sites.

A lot of other areas are inter-linking too. Some times when I am trying to find something I get caught in a loop and have to start over again from a different starting point.

For instance when finding out information on LED lights.

I exhausted most of the results that came in for my original search term, so am going to change it to {fibre,fiber} optics LED

Tada, whole new batch of sites to read though. :)
Re:I wish! by singularity · 2002-03-07 09:19 · Score: 2

Your argument holds true for most commercial sites, but not always for small to medium sized sites. Slashdot, for example, links to several dozen related web sites, a lot of them competition.

In addition, when I search on Google, more often than not I am not looking for huge commercial sites. I am looking for smaller pages, sometimes written and hosted by individuals, that contain information on the subject I am searching for.

These types of pages completely fall under your argument. They are not big enough to warrant ideas like "competition" and "sponsors." It is just some Joe Public, writing a web page about something he is interested in and housing it in the 5 megs of web space his ISP gives him.

--
- (c) 2018 Hank Zimmerman
Re:I wish! by rapid+prototype · 2002-03-07 09:35 · Score: 2, Interesting

Spider Sun and never find a single page in common with their close competitors like IBM.

i guess this page doesn't list a bunch of Sun competitors, like IBM, BEA, and CA, then. even competitors thrive off of partering with each other.

-rp
Re:I wish! by evil_one · 2002-03-07 09:45 · Score: 1

Uh, slashdot links to everything that Rob and Timmy think is 'cool' - we had a link the other day to a hub-in-a-teddy-bear, and today a link to a PGP press release - the two are as far as can be from each other.

--
Desperation is a stinky cologne
Re:I wish! by briosa · 2002-03-07 09:49 · Score: 1, Redundant

Interesting article here at http://www.operatingthetan.com/google/ about how the Church of Scientology exploits google's ranking system. The basic gist is that google flags pages as more important (or higher relevance) if they have more links pointing to them...so the CoS makes thousands of spam pages that points at its main pages. Google sees the thousands of links, assigns the main CoS pages a high relevance, and thus they're the first to come up in any scientology-related search. The moral being, for any new cool search technique devised to help fetch more relevant content, there'll be someone out there looking for a way to defeat it.
Re:I wish! by jspaleta · 2002-03-07 11:46 · Score: 2

I've seen a version of this algorithm before. At a SIAM conference held at WPI while I was an undergraduate. You can get around the "partner" effect by creating a feedback look between two types of ranked sites:
group A) Authorative sites...sites other sites link to but don't necassarily link to a lot of other sites

groub b) Link sites...sites that cobble together links to useful authorative sites based on subject matter.

In your algorithm you keep a track of a particular ranking in both groups A and B interatively.
Yer ranking in group A gets weighted by the quality of the sites in group B that point to you.
Yer ranking in group B gets weighted by how many quality sites you link to in group A.

Iterate the process...and you know what you have...you have an eigenvector problem....and what you get in the end is an eigenvector of highly ranked group B sites which span subsets of group A based on subject matter.

The cononocal example is the word jaguar.
Run this agorithm on a search engine and you will get atleast 3 very distinct collections....
the animal, the car, and the game system..primarily.

The problem is you have to ITERATE for it to be particularly useful...and that costs cpu time....I don't know if a search engine is gonna want to really invest that time.

Frankly I'm suprised any of this is patentable since I saw this at an academic talk like 6 years ago.

-jef
Re:I wish! by Anonymous Coward · 2002-03-08 00:11 · Score: 0

after all, who links to a competitor?
Pornographers.
<duck!>
Re:I wish! by ncstockguy · 2002-03-08 04:00 · Score: 1

Just because this search approach would not work with lousy commercial websites, does not mean it wouldn't be great for academic searches and the types of pages that DO link to related subjects. This could be a MAJOR step forward in the cause of online knowledge base building and serious research.
Re:I wish! by dasmegabyte · 2002-03-11 09:13 · Score: 2

Or it could be a worthless dead end.

At Internet World 2000 there were close to thirty companies offering new search engines, everything from voice controlled searching to variations on a miningco theme.

They're all dead. Not necessarily because their ideas weren't good...the best of these were eaten by google and altavista and lycos and are still around. They're dead because they offered nothing new to the searching public -- no better results, no improved searching. Nothing but good ideas listing no pages with a buggy interface. The searching public has no tolerance for buggy code or crummy results. This tool WILL be a nest of crummy links until they figure out a clever way to omit them...and by that time, we'll have already given up

--
Hey freaks: now you're ju

Useful as second try by FurryFeet · 2002-03-07 08:38 · Score: 1

The way I see it, this would work extremely well to find the most obscure references and complete information about a subject, but would be pretty bad for general purpose searching (a.k.a. googling). A subject like "cosplay" certainly tends to create a community of pages, but search for something like "distributed computing" or "operating systems".
If a search engine comes out of this, I think I'll first google for whatever I want, and if I can't find it/come out with too little info, I'd expand my search into this "communi-search"

But if I'm out searching by Anonymous Coward · 2002-03-07 08:39 · Score: 0

for Britney's Breasts (aka Brit's Tits)
I don't want to get stuck in an infinite loop of teeny-bopper Britney webrings.

Community/cross linked pages or not, they aren't relevant to my search.

Web Rings by DCram · 2002-03-07 08:39 · Score: 1

It seems to me that this is why web rings were started. The other thing that concerns me is what about the result speed. First you have to find a "good" page to start from and then follow its links ( and we all know they can get off topic really fast). Then if that one doesnt link to good pages start from another "good" page. nasty cycle.

I do believe it is a good idea but the person that thinks that all relevant pages link to more relevant pages has been taking more than harmless smoke breaks.

--
If I were only smart enough to accomplish the things I dream about.. Or maybe too dumb to care.

Some issues on linking. by Restil · 2002-03-07 08:40 · Score: 5, Informative

Google pioneered the use of links to deducepages' relevance. Its PageRank technology counts a link from site A to site B as a vote for B from A. But it does not take account of all the other sites to which A has links, as NEC's new technique does.

I won't pretend to know all the inner workings of google's search engine technology. But I believe that google DOES care about other links from site A. This falls into the hub and authority model, which is definined recursively. A hub is a site that links to a lot of authority sites. An authority site is a site that is linked to by a lot of hubs. Basically, authorities provide the content, and hubs provide links to the content. In this example, B is an authority site, and A is a hub.

The way the ranking works, is that if B is linked to by a large number of quality hub sites, then it has a respectively large quality rating. Likewise, if a hub links to a large quantity of high quality authority sites, then its quality will also be ranked highly as a result.

This also allows Google to provide links to sites even if the search terms don't match the content of that site. A hub that links to a lot of sites about cars will relate cars to ALL the links regardless if the word "car" is included on the site that is provided.

Of course, I'm not THAT familiar with google. Its possible I'm full of bunk. But I'm pretty sure it works this way to some extent and that google does pay attention to the hub based links.

-Restil

--
Play with my webcams and lights here

Re:Some issues on linking. by AeiwiMaster · 2002-03-07 10:54 · Score: 1

No, what you describe is the HITS algoritme.
Whice is more sensetive to than the PageRank algorithme google use.

Let h, a , p be the hub, authority and page vector.

Let L be the link matrix.

Hits: init h,a
while ( not convergence)
{
a=Lh;
h=L^(-1)a;
}

Pagerank:
init p;
while ( not convergence)
{p=Lp
}

this converget real fast to the primary
egenvector of L;

Knud
Re:Some issues on linking. by AeiwiMaster · 2002-03-07 10:56 · Score: 1

It should be: more sesetive to spam.
Re:Some issues on linking. by Anonymous Coward · 2002-03-07 11:32 · Score: 1, Informative

minor nitpick :

Hits: init h,a
while ( not convergence)
{
a=Lh;
h=L^(-1)a;
^^^^^^^^^^^^^
}

should really be
h = L^T a;

Another name for this is the
Kleinberg algorithm. i hope the parent gets
modded up since i have seen many people
mix up the page rank algorithm with
kleinberg's.
the hub and authority model is more elegant
IMO than the page rank algorithm which
does'nt have as great an intuitive justification
Re:Some issues on linking. by halflinger_n · 2002-03-07 12:25 · Score: 1

Google essentially provides what librarians and researchers call a citations index.
These things have been around for ages in the print world. I'm sure anyone whose ever done serious research can tell us how useful they are. (certainly anyone who has ever used Google can)
The description of the "new" search technology in the main story sounds like a "dumbed down", blunted or blurred version of the citations index concept to me - but maybe I'm missing something.
Re:Some issues on linking. by AeiwiMaster · 2002-03-08 02:40 · Score: 1

Thanks for correcting me.
I posted late at night.

I do not agree that Kleinberg/HITS is
more elegant.

It is more easy to spam than PageRank.

Say you need your website to be a good
authority.

Then your make a page which link to all the
good authorities + your own page.
That it links to all the good authorities makes
it a good hub but that it also links to your website makes your webside a good authority.

Knud

More Info on Extracting Macroscopic Information by LuxuryYacht · 2002-03-07 08:41 · Score: 2, Informative

Here are a few papers that better describe the rank technology involved:

http://www.cindoc.csic.es/cybermetrics/articles/ v5 i1p1.html

http://www.scit.wlv.ac.uk/~cm1993/papers/2001_Ex tr acting_macrosopic_information_from_web_links.pdf

.

--
Quidquid latine dictum sit altum viditur

Efficient Identification of Web Communities by headwick · 2002-03-07 08:42 · Score: 2, Informative

Here is the research working paper that goes into detail.

--
~ fact is not dependant upon your belief therein. ~ ~ Have I therefore become your enemy because I tell you the truth?

I feel bad for Disney... by Anonymous Coward · 2002-03-07 08:43 · Score: 5, Funny

Since so many of the Adult sites seem to have their "Please leave now..." links pointed at disney.com or nickelodeon.com or something.... will they end up in the adult communities? :)

Not that I would know...

Re:I feel bad for Disney... by wayne · 2002-03-07 09:11 · Score: 3, Funny

In fact, doing a google search on "please leave now" returns Disney as their second hit.

--
SPF support for most open source mail servers can be found at libspf2.
Re:I feel bad for Disney... by Coward,+Anonymous · 2002-03-07 09:26 · Score: 1

In fact, doing a google search on "please leave now" returns Disney as their second hit

So does leave it in me
Re:I feel bad for Disney... by Jaycatt · 2002-03-07 09:29 · Score: 1

Interesting indeed that "get the hell out" returns Disney as their FIRST hit.

--
"Shared pain is lessened; shared joy is increased. Thus we refute entropy" - Spider Robinson
Re:I feel bad for Disney... by Galvatron · 2002-03-07 10:31 · Score: 1

"it" and "in" are ignored by google, so it's just "leave" and "me," not necessarily in any particular order.

--
"The question of whether a computer can think is no more interesting than that of whether a submarine can swim" -EWD
Re:I feel bad for Disney... by Coward,+Anonymous · 2002-03-07 18:24 · Score: 1

"it" and "in" are ignored by google, so it's just "leave" and "me," not necessarily in any particular order.

It still affects ranking since leave it in me returns different results from leave me and if you insist on forcing the words in you can search for +leave +it +in +me which still returns yahoo number one and disney number two (+leave +it +in +me +baby returns an interesting number one site)

Clustering by harmonica · 2002-03-07 08:44 · Score: 5, Informative

Clustering pages is what other search engines like Teoma are doing already.

In a recent interview in c't magazine, a Google employee (Urs Hölzle) said, when asked about clustering, that they had tried that a long time ago, but they never got it to work successfully. He mentioned two problems:
- the algorithms they came up with delivered about 20 percent junk links for almost all topics
- it's hard to find the right categories and give them correct names, esp. for very generic queries

Of course, just because Google didn't get it to work properly doesn't mean nobody else can. But it's harder than it looks, and it's been known for quite a while.

Difficulty of Classification by gnugnugnu · 2002-03-07 08:47 · Score: 3, Interesting

This sounds like a useful idea and give use better directory systems, but its utility would be limited. Im sure there will be people poking holes in this algorithm in no time. Slashdot has a odd mix of subjects loosely tied together. News for Nerds is not a very strict group. Classification and grouping is a hard problem. There is no clear black and white, there are many shades of gray.

Interesting, but not as intersting as Google Bombing

--
Guilty

Target Marketing by DeadBugs · 2002-03-07 08:49 · Score: 0, Troll

Many sites also have advertisements that are related to the web site content usually in the form of pop-up ads or banners. Some are now trying to implement an insane subscription fee to avoid the advertisements

--
http://www.kubuntu.org/

Actually good filter by Teh+Grammar+Patroll · 2002-03-07 08:52 · Score: 0

A better way to state this might be:

"The article also points out, this kind of filtering can provide more useful content, as compared to today's text-based filtering."

Routers by Perdo · 2002-03-07 08:53 · Score: 3, Interesting

The internet is the world's best source of information and while transport of that information is built in, organization of that information is not. We have only half an internet.

We will really know what is out there on the net when Cisco includes a search function in their routers. Distributed searching. Access to over 90% of the world's data. Anonymous usage statistics. Person X searched for data (a) and spent the most time at www.example.xyz. Cross refrence it all and include hooks into TCP/IP V.x for cataloging search, usage and content statistics.

A website might contain information about leftover wiffle-waffles. That website sends that same data 1000 times an hour to end users. I want the router to pipe up "1000 unique page veiws for leftover wiffle-waffles" somewhere else a router says "500 unique page veiws for leftover wiffle-waffles". So when I do my search, I get 2 hits, most popular and second most popular.

Why incorperate it into TCP/IP? what good is moving all that data if it is just a morass of chaos? Let that which transports it also serve to catalog it. Currently, user data's content is transparent to TCP/IP. But if I wanted my data to be found, I could enclose tags that would allow the Router to sniff my data, insuring my data was included in the next real time search.

--

If voting were effective, it would be illegal by now.

Exploiting search engines that rank popularity by Violet+Null · 2002-03-07 08:54 · Score: 5, Interesting

Interesting article here at http://www.operatingthetan.com/google/ about how the Church of Scientology exploits google's ranking system.

The basic gist is that google flags pages as more important (or higher relevance) if they have more links pointing to them...so the CoS makes thousands of spam pages that points at its main pages. Google sees the thousands of links, assigns the main CoS pages a high relevance, and thus they're the first to come up in any scientology-related search.

The moral being, for any new cool search technique devised to help fetch more relevant content, there'll be someone out there looking for a way to defeat it.

Re:Exploiting search engines that rank popularity by tiltowait · 2002-03-07 09:08 · Score: 5, Informative

Did you read the update on the page, or are you just parroting the previous +5 post on this?

Since this was first brought up a few days ago, the Scientology volunteer editor at the Open Directory Project, an upstream content provider for Google, was fired.
Re:Exploiting search engines that rank popularity by ari{Dal} · 2002-03-07 09:49 · Score: 1

the CoS makes thousands of spam pages that points at its main pages

Fascinating. CoS is taking its clues from pr0n spammers in its search for higher rankings... gee, wonder how they stumbled onto THAT little idea?

I bet browsing through the images of CoS browser caches with ACDSee would produce some entertaining (and disturbing) results

--
Moral indignation is jealousy with a halo - H. G. Wells
Re:Exploiting search engines that rank popularity by Anonymous Coward · 2002-03-07 10:42 · Score: 0

FYI his post wasn't redundant. He posted at 3:54 pm. The other person posted at 4:49 pm.
Re:Exploiting search engines that rank popularity by Agent137 · 2002-03-07 11:43 · Score: 1

Google only counts links from pages that have a PR of 3 or higher (see google toolbar to get page rank - PR). Also, pages from the same web site, same IP address or even an IP address very close do not count much for page rank if at all. Spammers have a hard time exploiting the a PR based linking system
Re:Exploiting search engines that rank popularity by PureFiction · 2002-03-07 12:59 · Score: 2

There is a Google Experiment underway to see if search results for Scientology can be reclaimed.

Only a few thousands links more needed... :-)
Re:Exploiting search engines that rank popularity by Mattygfunk · 2002-03-07 14:50 · Score: 2

The page linked to has a lot of links to the offending sites. You have to wonder whether it is part of the scam.
Re:Exploiting search engines that rank popularity by Anonymous Coward · 2002-03-07 21:37 · Score: 0

That is entirely the point with this "new" algorithm. Which I've previously posted about on /. over a year ago at least. If I can come up with something like that, it's nothing special. It just takes time to implement properly.

Say you don't want Scientology pages to turn up first. You rank their main pages to -1 or whatever low ratio you want. Then this ratio will affect hub-pages whichever way you want: subtle or destructive (ie, one link and you're toast). Unless these hub-pages link to more pages with very good ratio.

The REAL trick is to balance all this, hopefully in a mathematically and statistically proven way.

Btw, one simple ratio is also very bad. It would be better to categorize it into several different content-specific rations. Community-search can help you find related pages quickly, but as always, NEVER 100% correctly. That way, a user might chose to rate up- or down all CoS pages, porn etc.

Of course, any scheme can be defeated by the ego-filled mind and lots of time.
Re:Exploiting search engines that rank popularity by tiltowait · 2002-03-08 01:55 · Score: 2

>FYI his post wasn't redundant

Yes it was, coward. If you're going to whore with this method, at least stay signed in to defend your tactics.
Re:Exploiting search engines that rank popularity by Violet+Null · 2002-03-08 05:12 · Score: 2

Yes it was [slashdot.org], coward. If you're going to whore with this method [slashdot.org], at least stay signed in to defend your tactics.

Hrmm. What to say?

1) I have no need to whore, being at 50 karma for, oh, I dunno, the past half year or so.

2) The AC post wasn't mine, which is more or less impossible to prove, but I thought I'd state it for the record.

3) I hadn't read the comment you referenced, but had been referred to it from a newsgroup I visit elsewhere.

Any other unbiased assumptions you made that I should address?

Not always true. by www.sorehands.com · 2002-03-07 08:55 · Score: 2

A majority of links may go to related information depending on the type of page. One of my site, contains links to mostly related information. But, there are links to humor (because we all need a laugh) some links to Amazon. My other site contains my resume, which have links to companies I worked for. I also have some links to articles that I have written. Since this site is more than one subject area, it may hold more true. Porn sites link to cyberfilters (Surf Patrol, Cyber Patrol, etc.), so you will not find many offsite links related to porn.

So, this only holds true with a focused sites. Using links, but then checking the links based on text would be useful, but not just links alone.

--
Fight Spammers!

Degrees of separation? by delphin42 · 2002-03-07 08:58 · Score: 2, Interesting

I don't really buy into the idea that linked pages will necessarily be related to the same subject. Look at sites like slashdot or cnn, which link to a variety of pages in totally disparate subjects. If you applied transitivity, then you'd end up with every connected page on the web being on the same subject. Page A links to B, so B is probably on the same subject as A. Page B, links to page C, and therefore is probably on the same subject as B, and therefore A as well...Oops. The task of categorizing pages, unfortuately may always have to be done by humans to be done well. This type of software might help categorizing websites, but it won't replace wetware anytime soon.

--
-- Adam

Hmmm, so it's a "Web"? by Fencepost · 2002-03-07 08:58 · Score: 3, Interesting

So it's actually working on the basis of webs of related sites - not a novel concept, but useful.

I suspect that some of the commercial knowledge management tools have been doing something much like this for some time, and TheBrain.com has had a product to manually build this kind of network of clusters for some time. The key thing about this is that with web indexing/cataloging the information needed to do the automatic linking is available.

TheBrain.com seems to have a working demo of using it for the Web at WebBrain.com based on the Open Directory Project. It's not a great example because of display limitations that don't really let you see more than one cluster of information at the time, but it's one example of the general concept. Once you dig down in an area you can see how it shows links between related categories as well.

Note: the demo above says it requires Java 1.1 and IE 4.01 or Netscape 4.07+, to bypass that test try here. Seems to work fine in Netscape 6.2, and will probably be OK in Mozilla if the JRE is available.)

--
fencepost
just a little off

why not a 3d search engine? by Lumpy · 2002-03-07 09:02 · Score: 5, Interesting

How about displaying on a 3D image dots as file with the zero point of xyz being the absolute most significant and the nspread out the hits from there? that way we can zoom in on what interests us in the search.

I.E. search for linux apache router

linux is one axis, apache is another and router is a 3rd. if the pages are relevent in that context then the closer to zero they will show up while linux apache donuts will resolve close to zero on the XY but be way out on the Z axis..

--
Do not look at laser with remaining good eye.

Re:why not a 3d search engine? by John+Harrison · 2002-03-07 09:22 · Score: 2

Your idea is really interesting. You would have to specify which three parameters (or group the parameters) to use for the graphing. You would also need a way to visualize and rotate the data.
One of the nice things about Google and other current search engines is that you can easily look at the context in which the search term occurs and determine if the link is relevant. I think this would be harder to do in 3D. It would be nice if you were able to weight your search terms (scale of 1-10?) on Google. That might accomplish the same goal as what you want without the 3d niftyness.

--
Lasers Controlled Games!
Re:why not a 3d search engine? by swillden · 2002-03-07 11:26 · Score: 2

How is the graphical display better than just calculating the distance each hit is from the origin, and using that distance as a ranking metric? That would give you exactly the set of points closest to the origin. If one of your search terms is less important to you (meaning you'd be willing to look at pages further out on that term's axis), then maybe what you need is a way to specify that that particular term should have less effect on the ranking. That approach would be extensible beyond three search terms, as well.
What I'd really like to have is the ability to specify that terms need to be near one another within the document. Sometimes using quotes to delimit a phrase works, but often the words can appear in various orders and various ways, but if they're all close together there's a very high chance that the text is discussing the stuff I'm interested in.

--
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
Re:why not a 3d search engine? by ender81b · 2002-03-07 11:30 · Score: 2

There is quite a bit of research out there that suggests that most people find 3d navigation imposssibly hard to understand/use. Here are some relevant links:

Summary of book on Web Usability

Why 3-D navigation is bad (People aren't frogs)

Not to mention how to you display such things to a reader who is blind or has any other type of disability... the list goes on. Beautiful idea, but not very good in practice.
Re:why not a 3d search engine? by Lumpy · 2002-03-08 05:00 · Score: 2

and we can add another dimension to the fray..

color to make a 4th,5th and 6th dimensions in it. R,G,and B components can also be added... giving you a 6d search engine... not only are the ones closest to center what you are after, but the ones that are white. (invert it for the politically correct people)

--
Do not look at laser with remaining good eye.

We Need A Seach Engine For Breaking News by Anonymous Coward · 2002-03-07 09:04 · Score: 0

What I would like is for search engines to update more often so that you can find breaking news with the search engine.

This is not new work by Anonymous Coward · 2002-03-07 09:05 · Score: 5, Interesting

This work is not new. In fact I submitted my undergraduate thesis on this topic. The roots of this work is really in citation analysis where the idea is that references that are highly cited are high quality references (this is the idea that Google is built around). Extending this to the web, a "reference" can be thought of as a "link" and you can generalize the hypothesis to the idea to: "similar works link to each other" and therefore you should be able to find communities of similar documents by following links within documents.

Intuitively this seems reasonable and in practice this is often the case when there is no conflict of interest for a document to link to another document (as in the case of researchers linking to other works in their field). Yet, often this is not the case when there *are* conflicts of interest (a pro-life site will probably not link to a pro-choice site;BMW will probably not link to Honda or any of it's other competitors). Therefore, since the truth of the hypothesis that "similar documents link to each other" is not clear, I worked to test this very idea.

To do this I used The Fish Search, Shark Search, and other more advanced "targeted crawling algorithms" that take connectivity of documents into account (as is discussed in the Nature article), but these algorithms often go further than just using the link relationship by taking the contextual text of the link itself as well as the text surrounding the link into account too when choosing which links are the "best ones" that should be followed in order to discover a community of documents that are related in a reasonable amount of time (you'd have to crawl through a lot of documents if documents have as few as, say, 6 links per page on average! Choosing good links to follow is crucial for timely discovery of communities). The conlusion of my thesis was that it is (unfortunately) still not clear whether the hypothesis holds. I only did this work on a small subset of web documents (about 1/4 million pages) so perhaps a better conclusion would be reached by using a larger set of documents (adding more documents can potentially add more links between documents in a collection). What I did discover however, was that if document communities do exist, you have a statistically good chance of discovering a large subset of the documents in the community by starting from any document within the community and crawling to a depth of no more than 6 links away from the starting point. (This turns out to be useful to know so that your crawler knows a bound on the depth it has to crawl from any starting point). Moreover, if you have a mechanism for obtaining backlinks (ie. the documents that link to the current document) you can do discover even more of the community...

No, this is not the shiny new thing... by Anonymous Coward · 2002-03-07 09:08 · Score: 2, Informative

ISI has been doing this for years with their databases. You look at a research paper, and jump around by what it cites and what cites it. It's good stuff, helps you find research that's related to what you're doing that you'd have never thought to actually search for.

The idea predates Google, it probably predates you. They did it in print, way back when.

Um, because your monitor isn't 3D? by tiltowait · 2002-03-07 09:12 · Score: 2

Ug, a 2D web directory is confusing enough.

Re:Um, because your monitor isn't 3D? by Anonymous Coward · 2002-03-07 09:25 · Score: 0

So this is why nobody plays Q3A?

I've used 3-D data models before when modeling relationship networks, and they're really pretty effective, after the user has a burn-in period to get used to them. But there is a core difference; in a relationship model, you're interested in the structure of the network itself, whereas in this case, it would just be a way of getting to the data beyond it. Its effectiveness would probably be dimmed by that alone.

-Baka!

Many web-marketting businesses based on this by SimplyCosmic · 2002-03-07 09:13 · Score: 3, Interesting

The CoS isn't the only ones who try to use this technique in order to make their sites rise in search engine ratings.

There are a number of those "Get More Hits For Your Website Cheap!" sites which try to do so by getting member sites to download an html file which contains links to most of their members, and then have you link this from your own site.

Much like a pyramid scheme, as new members join the get the same file with links to your site, thereby increasing the number of sites with links to you and possibly raising your position in search engines.

yeah but by Anonymous Coward · 2002-03-07 09:13 · Score: 0

It'll suck when you do a search for slashdot and it returns links for goatsecx

This Could Actually Help Enhance Accuracy by FreeUser · 2002-03-07 09:15 · Score: 5, Interesting

You make a very interesting point:

Problem with this: "most" websites do not link to sites with similar content. Most websites link to "partner" sites that have nothing in common with them -- after all, who links to a competitor?

Good websites link to similar sites -- academic websites link to simialr sites and sources.

Combine the algorithm described in this article with google's approach (or some other contextual approach to deterimining relevance) and you not only have a way of identifying "communities," you have a way of easilly identifying "marketdroid mazes of worthless links" as well.

Since the content of most marketdroid sites is usually next to worthless, the hits for a given search could be ordered accordingly. Sites, and groups of sites, that clearly form communities related to the topic you're interested in at the top, single websites as yet to be linked to somehwere in the middle, and marketdroid "partner" sites at the very bottom.

This would actually produce better, more useful results than either approach alone.

--
The Future of Human Evolution: Autonomy

Re:This Could Actually Help Enhance Accuracy by Shotgun · 2002-03-07 09:58 · Score: 3, Interesting

A decent implementation of the algorithm would search for and rank pages as it currently does, but then 'communitize' the results. If result 5 is in the same community as result 2, don't put 5 on the page. Instead, change result 2 to point to another page that will list most of the community that 2 and 5 share, and then increase the ranking of 2 or just list the 'communities' first in the result. This will greatly shorten the results that must manually be looked at by categorizing them.

--
Aah, change is good. -- Rafiki
Yeah, but it ain't easy. -- Simba

A goal. by CrazyBrett · 2002-03-07 09:15 · Score: 2

What I'd really like to see is a search engine that can differentiate (reasonably well) between sites with information and sites that are trying to sell you something. It seems that whenever I'm looking for a good info page about X, all I can find is someone trying to sell me X, and vice versa.

I would imagine that using modern search engine techniques, one would be able to determine what commercial pages "generally" look like, and what informational pages "generally" look like, and categorize appropriately. If you used a learning neural network, you could even accept user ratings on specific search results and use that to fine-tune the algorithm.

Re:But.... what about ad servers? by redcup · 2002-03-07 09:16 · Score: 2, Insightful

What about doubleclick? Their servers link to anything and everything that nobody finds interesting!

I think it's a great concept that will make lesser known content accessible to the average user. Instead of spending almost all their online time on a few huge sites (AOL, MSN, CNN, and a few other media giants), we can jump to a page with the same topic but no advertising budget. But how do you rank and order the list of members? Traditional text search? Even if a community has only a few hundred members, few users will go to page five in the list to find a site. Admittedly, it's only a matter of time before you can pay to be listed at the top of the community membership, instead of a random listing.

And like all good ideas, this system wouldn't be free of abusers. People could always spam their page with links to major sites using single pixel clear gifs, thus making their page a part of any community I wanted. So it becomes a process of "give me sites with links like this page, but not links like the following black hole listed pages." Useful for filtering content (for good or bad reasons).

--

RC

Damn! by guinsu · 2002-03-07 09:16 · Score: 2

I actually came up with a similar idea 3 years ago. Not that anyone will believe me, but kudos to the people who actually put some work into it instead of me sittin on half and idea.

More information by Anonymous Coward · 2002-03-07 09:18 · Score: 0

See http://webselforganization.com/ for more details and examples.

assume = ass of u & me by Anonymous Coward · 2002-03-07 09:18 · Score: 0

So it's safe to assume that webmaster's never link to other interesting pages, pages that don't necessarily have anything to do with that particular topic.

I don't think so.

Just what we need by ab0mb88 · 2002-03-07 09:18 · Score: 1

Instead of getting a porn link in every search, now you get counter links and online casino links, this is a great improvement.

I guess when we get bored though we can search for search engines and watch the system grind to a halt.

Did you see by Anonymous Coward · 2002-03-07 09:20 · Score: 0

the ad (Visa checkcard?) where Kevin Bacon forgets his ID and brings in six people who know each other until the last one knows the store clerk? I need an MPG of that one.

DON'T feel bad for Disney!!! by PaxTech · 2002-03-07 09:21 · Score: 2

I believe feeling bad for Disney constitutes a thoughtcrime on Slashdot right now.. ;)

--
All movements for social change begin as missions, evolve into businesses, and end up as rackets.

Privacy Violation by LarryRiedel · 2002-03-07 09:21 · Score: 1

How long before someone claims that organizing and using information which relates various sites this way is some sort of privacy violation?

N2H2 by Anonymous Coward · 2002-03-07 09:21 · Score: 0

"The first application of community searching may be to fence off areas of the web such as pornography or hate-speech communities, says Flake." N2H2 has been doing exactly that for a couple of years now.

More information by Anonymous Coward · 2002-03-07 09:22 · Score: 0

See http://webselforganization.com/ for the research paper, more details, and examples.

Explanation of the joke by Wire+Tap · 2002-03-07 09:22 · Score: 3, Informative

For anyone out there who doesn't quite know why this is +5 worthy, here is the joke:

Super Bowl Sunday a commercial aired, featuring none other than Kevin Bacon at a retail store, trying to use a check to pay for his goods. The man behind the counter asked to see ID, but Bacon didn't have any on him. What now? Bacon runs around town gathering people (an extra he played in a movie with, a doctor, a priest, an attracive girl, and maybe one other guy?), who all had some ties to one another, through the other 6 in the group. The attractive girl once dated the sales clerk in the store, so Kevin explains that they are "practically bothers," hence putting to good use the principle of 7 degrees of seperation.

Therefore, the humor lies within. :) This is, of course, a very pop-culture oriented joke that will probably fade even more quickly than AYB did after its behemoth prime of last year and the December before. Long live the meme.

--

Man is born free; and everywhere he is in chains.

Re:Explanation of the joke by jonnythan · 2002-03-07 09:28 · Score: 2

"Six Degrees of Kevin Bacon" has been around, an in pop culture, for many many years.

Methinks you just didn't know about the inspiration for the commercial, which does surprise me ;)
Re:Explanation of the joke by LoveShack · 2002-03-07 09:29 · Score: 1

Actually, the commercial itself plays off of the well-known "Six Degrees of Kevin Bacon" which states, in a paraphrased manner, that any actor (or other member of Hollywood, I suppose) can be linked back to Kevin Bacon in six degrees or less. This is, of course, just a specific example of the degress of seperation. But, the Bacon slant didn't "just show up"...I recall seeing a websitethat would actually show you any actor's linke to Bacon several years ago. I guess it's still around...I dunnno.
Re:Explanation of the joke by Anonymous Coward · 2002-03-07 09:29 · Score: 0

Since Six Degrees of Kevin Bacon has had a web site for a few years now, it obviously hasn't faded quicker that AYBABTU...
Re:Explanation of the joke by dan133 · 2002-03-07 09:32 · Score: 1

I don't think the poster was referring to the Super Bowl Commercial. 6 degrees to Kevin Bacon is a popular game that's been around a while. Basically, an actor/actress is picked and then in 6 steps or less a path has to be found to Kevin Bacon. This is of course, the reason behind the Super Bowl Commercial, however.

My girlfriend has been playing this every once in a while for years.
Re:Explanation of the joke by Wire+Tap · 2002-03-07 09:38 · Score: 1

Ahhh, you are indeed correct. I actually had heard of that a while back, but it didn't strike me as the source of that commercial... strangely enough. Thanks for reminding me. I think. :) I've never been particularly big on pop-culture. hehe

--
Man is born free; and everywhere he is in chains.
Re:Explanation of the joke by br0ck · 2002-03-07 10:13 · Score: 1

I think you mean the Oracle of Bacon.
Re:Explanation of the joke by Anonymous Coward · 2002-03-07 21:09 · Score: 0

What you say!!!
Re:Explanation of the joke by Anonymous Coward · 2002-03-08 00:35 · Score: 0

Someone set up us the bomb !!!

A great Website by Anonymous Coward · 2002-03-07 09:24 · Score: 0

http://www.genxius.com

no ad's
no banners
just geek content

I could mess this up badly... by SteamedGeek · 2002-03-07 09:27 · Score: 0, Troll

Just put up a dozen or two web pages on a dozen or two different free websites with links that point wildly all over the net, especially to pr0n. MUHUHAAAAAA

--
Life Sucks... Have a Beer and a Smoke then Smile Damnit!!!

"next generation over Google" my foot by augustz · 2002-03-07 09:27 · Score: 4, Insightful

Why do all the fanboys who swallow stuff that has yet to actually prove worthwhile in the real world mod comments lauding this stuff up?

It reminds me of all the graphics chip makers, computer chip makers, heck, even zeosync with their incredible breakthroughs. 90% of the time, when anyone takes a hard look at it it turns out to be a waste of time and money.

So, before proclaiming this the "next generation over Google" why not check to make sure google hasn't already thought of it and discarded the idea. Or that it won't lead to stupid circular clusters, 90% of the time I'm not interested in partner sites, but competitior sites. Is slashdot in the Microsoft cluster?

And above all, stop the judgement calls like "this is the next generation" unless you've got some special insight and qualification to make that call.

Re:But.... what about ad servers? by neuroticia · 2002-03-07 09:38 · Score: 2, Insightful

Hm. That's not the problem I see right off the bat-- yeah we'll have a bunch of people trying to scam their way to the top, but we have that now as it is. The problem that I see is that sites that link to a large number of sites, and that have a large number of sites that link to them will be considered part of more communities while sites with a lot of relevant interesting text but who have few links and few linkers will be considered to be part of fewer communities.

The issue of people creating mass pages of links could be resolved by "teaching" the engine to ignore sites that link to too many different threads, thus cutting out search engine directories, blogs, and other "topic-non specific" pages, or lumping them together as another category.

Sort of "If a page has x number of links to y number of topics then it can be considered for category z but if y is higher than the allowed number..."

Or something... Oh God. I need my caffiene.

-Sara

This is not a new idea by John+Harrison · 2002-03-07 09:45 · Score: 3, Informative

I will refer you to the Clever project at IBM. I first read about this years ago when Google was still a project at google.stanford.edu.

Clever does Google one better by separating the results of searches into "hubs" and content. Hubs are sites with lots of links on a particular subject. Content sites are the highly rated sites linked to by the hubs.

I thought it was a very intersting concept and I am surprised that it was not comercialized. Of course, IBM is in the business of buying banner ads rather than selling them. They could always do like /. and OSDN and mostly run ads for their own stuff though....

--
Lasers Controlled Games!

Re:This is not a new idea by 3am · 2002-03-07 10:21 · Score: 1

How do you know this is not how Google creates its search results? What you've described sounds exactly like how Google describes their technology:

"PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important."

Important, high-quality sites receive a higher PageRank, which Google remembers each time it conducts a search. Of course, important pages mean nothing to you if they don't match your query. So, Google combines PageRank with sophisticated text-matching techniques to find pages that are both important and relevant to your search. Google goes far beyond the number of times a term appears on a page and examines all aspects of the page's content (and the content of the pages linking to it) to determine if it's a good match for your query."

http://google.com/technology/index.html

--

A: None. The Universe spins the bulb, and the Zen master merely stays out of the way.
Re:This is not a new idea by John+Harrison · 2002-03-07 11:57 · Score: 3, Informative

How do you know this is not how Google creates its search results? What you've described sounds exactly like how Google describes their technology:
I know because I have read about both technologies. I discussed the merits of Clever v. Google a few years ago with classmates that were taking the class at Stanford that spawned Google. That is how I know.
End of Rant
There is an excellent article on Clever that appeared in Scientific American a few years ago. It was linked to from the page I origianlly posted. You should check it out. Clever returns results divided into the catergories of "hubs" and "authorities". I have never noticed Google doing that/
Here is an excellent summary from the article on the differences between Clever and Google:
Google and Clever have two main differences. First, the former assigns initial rankings and retains them independently of any queries, whereas the latter assembles a different root set for each search term and then prioritizes those pages in the context of that particular query. Consequently, Google's approach enables faster response. Second, Google's basic philosophy is to look only in the forward direction, from link to link. In contrast, Clever also looks backward from an authoritative page to see what locations are pointing there. In this sense, Clever takes advantage of the sociological phenomenon that humans are innately motivated to create hublike content expressing their expertise on specific topics.
Of course Google has tweaked their method since this article was written, however it has not become Clever.

--
Lasers Controlled Games!
Re:This is not a new idea by 3am · 2002-03-07 13:18 · Score: 2

I have heard Jon M. Kleinberg give a technical talk about the search algorithm (or variant of) that I assume Clever is using while I was at Cornell. Let's not get in a pissing contest about credentials, we both obviously have exposure to this subject.

I simply assumed that Google used a similar algorithm, based on their description of it. Thank you for the link, it was informative.

--

A: None. The Universe spins the bulb, and the Zen master merely stays out of the way.

wouldn't work for .com sites by NathanBos · 2002-03-07 09:50 · Score: 1

I'd think this would be a lousy way to index commercial sites, which would avoid linking to each other. e.g. the one link you're sure not to find on Ford.com is Generalmotors.com!

--
Democracy is the worst form of government ever devised, except for all the others. -Winston Churchill

To easy to "fool" the search engine. by Tigris666 · 2002-03-07 10:00 · Score: 1

I would see this as a bad thing, because I could easily confuse this "community" of related pages theory.

In fact I see a lot of pages on the web that would break this. A lot of people make their own sort of "bookmarks" page that they can get to from anywhere on the web, then use that to click links and go to their favorite web pages. These pages may not be related at all. E.g. I could have slashdot and my favorite porn page on the same bookmark page, not really related, but doing a search for slashdot would find my page, then see a porn page as being related?

Sounds about right :)

--
Kids, you tried your best and you failed miserably. The lesson is, never try. -- Homer J. Simpson

Oh no, I can just see it now by Skevin · 2002-03-07 10:09 · Score: 1

Your search for "mysql perl dbi" turned up 2.527e8 results.
Showing 1-10 of 2.527e8:

X10: the smallest spy camera for your computer.
Need a better mortgage quote? Click here.
Women: Lose 4 inches by next week.
Men: Gain 4 inches by next week.
Look closer. Not that close. Sorority Boys opening this Firday.
This link requires your browser to support the ISO-2022-CN Chinese Characterset to display properly. BlahBlahPorkShoulderBoneGroundUpAndStuffedIntoRect angularCansForMontyPythonToWriteSongsAboutBlahBlah .
Reliaquote: Save up to 70% on Life Insurance.
Take this short survey and win $100.
This link requires your browser to support the ISO-2022-KR Korean Characterset to display properly. BlahBlahPorkShoulderBoneGroundUpAndStuffedIntoRect angularCansForMontyPythonToWriteSongsAboutBlahBlah .
Learn the Secrets of Financial Independence in 30 days!

Solomon Chang

Office 2K/XP is the only MS product with accurate warnings during Setup: most of the packages tell me to "Run from My Computer".

--
"Twice half-assed makes an ass whole." --Solomon K. Chang

This is new? Hardly. by Anonymous Coward · 2002-03-07 10:09 · Score: 0

The principle of pages linking to similar pages was known aeons (well, as long as there has been hyperlinks) ago.

My site is within 6 degrees of Slashdot! by DeadVulcan · 2002-03-07 10:11 · Score: 1, Offtopic

Now that I've wasted WAY too much company time doing this, I must report that I was able to get from Slashdot to one of my personal web sites with only 6 jumps.

It wasn't six clicks, because I didn't count things like splash screens, click-through ads, and drilling into a site's "Links" page.

But I'm happy to report that I was able to get to my site with only five intervening web sites. I won't post actual links here, but you're welcome to try and follow along (if you're so inclined)...

Slashdot Article: "Nintendo GameCube Clone Out In Japan" (go to "Related Links")
IGN: GameCube DVD (go to "Affiliates" page)
Drew's Script-O-Rama (go to "Links" page, under "Friends of the Rama")
The Shishi-Odoshi Homepage of Ari-Matti Saren (go to "Some Links")
The International Shinkendo Federation (go to their "Links" page)
Nihonto Page of Alan Quinn (see his recommended links)
Japanese Castles - This is my page!

Of course, now I've posted, you can just hit the URL beside my name and get there in two clicks...

I should also say that this is not a silly attempt at advertising. I don't make any money off my web site. It's just a silly attempt at demonstrating the interconnectedness of the web, and the oneness of human knowledge (yeah, that's it).

--
Accountability on the heads of the powerful.
Power in the hands of the accountable.

Oracle of Bacon by harmonica · 2002-03-07 10:15 · Score: 3, Interesting

In addition to the other replies, here's the link to the Oracle of Bacon that lets you find out the degree of separation between Kevin and any other person who is featured at IMDB.

There is also a generic search that lets you combine any actor with any other actor. Unfortunately I have forgotten who the best-connected actor was (average to all other actors is smallest). Anyone?

Re:Oracle of Bacon by swillden · 2002-03-07 11:14 · Score: 3, Informative

Here are the top 1000. Number 1 is Christopher Lee (Saruman in FotR), probably largely because he's been in 228 films.

--
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
Re:Oracle of Bacon by Caraig · 2002-03-07 14:31 · Score: 1

Terrifyingly enough, Ron Jeremy is up there. In fact, he tends to beat out Kevin Bacon.

It's because Ron was in Reindeer Games and Ronin. I think those two movies had more "big name" actors than any other (recent) movie.

--
"I am an Adept of Tantric VAX."

Re:But.... what about ad servers? by Com2Kid · 2002-03-07 10:30 · Score: 1

Blogs are VERY specific.

How?

Simple, until a year or so ago I did not even know that such a thing existed. (well, unless you count /. , which I do not).

Blogs themselves exist in a sort of quasi-weird community that many users never stumble on to, they are very self referencing to one another and tend to link within them selves a lot.

Heck look at Keenspace (which many people use as a sort of weird graphical Blog. . . . ), comics on there always link from one keenspace comic to another, and those other comics then typicaly link back to the origin comic and ....

WOH, new idea, hold on, annyways. Toodles.

--
Need help treating your acne? Come here!

Unfavorable to E-commerce by version5 · 2002-03-07 10:37 · Score: 2, Interesting

This model doesn't favor businesses at all. If there's one thing a commerce-oriented website won't do, its put a lot of links to their competitors. Depending on your point of view, this is a great thing. Unfortunately, many businesses believe that advertising means getting in your face as much as possible, there's no such thing as bad press, etc.

Amazon.com is an example of this: I bought a pair of speakers from them a several months ago, and yet every time I go there, they helpfully inform me that they have these great new speakers on sale! Buy now! I suppose it works to recommend similar books and CDs, but when someone buys speakers, they usually stop being in the market for speakers after that.

Anyway, I don't know why no-one has thought of making an e-commerce-only search engine. I think there's a clear distinction between those two types of searching that warrants a separate engine. Sometimes you want to buy stuff, and sometimes you just want information. When you are doing one, the other just gets in the way, and disguising advertising as content like AOL/MSN/AltaVista do just discourages you from using their services. Obviously, web-based businesses have a long way to go before they actually realize, "Oh, Internet users don't like to be tricked! Maybe if we were straight-forward with consumers they'd be more trusting of us."

--

"It's Dot Com!"

What about for none western cultures? by Com2Kid · 2002-03-07 10:41 · Score: 3, Interesting

Anybody else here (well, obviously, likely quite a few people) ever browse around foreign (to me at least. :P ) language sites?

The culture that exists there-in is defiantly quite different.

Japanese sites are even MORE self referencing then American sites. This trend has taken off onto American sites though in the form of Cliques, which themselves tend to lie outside of the sites that many of us /.'ers typically frequent.

Seriously though, in Japan it seems that sites actually have others ask permission to link to them! (As an aside, whenever that topic is brought up here on /. people tend to get all freaked out about it. ^_^ )

This obviously creates a VERY different social structure that heavily alters the dissemination of information, not to mention the way that sites are linked together.

Here in the states (or any other culture that has pretty much a free linking policy) it is common to say "oh yah, and for more info go to this site over here and also this site here has some good information and and and . . . . "

Anybody who reads www.dansdata.com knows how he (uh, Dan obviously. :) ) likes to sprinkle relevant to kind of relevant links throughout his articles and reviews. Almost all of the links are VERY interesting and much can be learned from them (he does link to e2.com quite a bit though. :P ) but that many of the links are to further outside resources on the same topic.

(such as a LED light review having links to the Online LED museum)

In a culture where linking is no so free, I would think that there would be more of a trend towards keeping a lot of the information in-house so to speak, and thus at the very least the bias's that the search engine uses to judge relevance of links would have to be altered a bit.

Links would have to be given a higher individual weight, since their would be a larger chance of them being on topic.

--
Need help treating your acne? Come here!

Credit where credit is due by Quixote · 2002-03-07 11:22 · Score: 3, Interesting

Prof. Jon Kleinberg of Cornell did this work many years ago. IIRC, he was the first to come up with the idea (first published in 1997). Check out his list of publications for the work (and related stuff).
Disclaimer: I happen to know him, but this is not biased.

Question: by MbM · 2002-03-07 11:34 · Score: 2

Does this mean that doubleclick must be a wealth of information? There certainly is a number of sites linking to it...

--
- MbM

Q3A levels are 2 1/2 D by yerricde · 2002-03-07 11:38 · Score: 1

So [difficulty in navigating a 3D UI] is why nobody plays Q3A?

Quake III: Arena (original levels, no mods) is mostly 2 1/2 dimensional. In terms of actual gameplay, it's really no more 3D than Doom. Descent and friends, on the other hand, are 3D, but they're quite a bit less popular.

Besides, 3D is for games, which are supposed to be a challenge. Finding information on the Internet is not supposed to be a challenge. You really shouldn't be using 3D unless you're trying to represent a solid object, and even then use it sparingly. Even Q3A uses a 2D menu interface.

--
Will I retire or break 10K?

Does this really help. by The+MoMo+King · 2002-03-07 11:51 · Score: 0

I usually search for help with problems dealing with work. Most of these searches are how to do something in Linux/C++/STL/etc. My problem isn't finding good hits, but spending all the time to read those good hits to find usable information. I'm not sure an algorithm will help me out there. I'd would much rather have a better interface to a search engine then a better search engine algorithm.

Question: How many dimensions for the web?? by EccentricAnomaly · 2002-03-07 12:01 · Score: 2

I high school (10 years ago) we had an assignment about thinking about how to map cyberspace. My partner and I decided that the best way was via a Ven diagram as oppossed to a physical map of where machines actually were. That is, map the information (then mostly on ftp, gopher, and non-internet BBS's) by subject. You would have a "sports" circle and a "medicine" circle and the overlap would be sports medicine. As a hobby, I've been trying to make such maps every since.

Later, in college, I tried to model a website as a physical system with the links acting like springs. You basically make up some formula causing different web pages to repulse each other and then make links give an attractive "force" that grows with the "distance" between web sites.

This gives you a system of equations that you can solve for an equillibrium point giving an "information distance" between web sites. This will tend to group websites on similar subjects together because they tend to link to each other.... but then again, who cares how close related are to each other, it should still be possible to get a cool picture with this data.

But the stopping point I came up against was how to represent these information distances as a space. I couldn't figure out how to calculate the dimensionality of the space. Was it 2-D, 3-D, or 400-D?

Here's an example of why this is a problem: Take four points that are all 1 meter from each other. In 3-D these points form the corner of a tetrahedron, but you cannot draw these points in 2 dimensions. If I have N equidistant points, I need a space with at least N-1 dimensions.

So how many web pages can be at the same "information distance" from each other? How many dimensions are needed to map the web this way?

Maybe this question only interests me, but I find it fascinating.

--
There are 10 types of people in this world, those who can count in binary and those who can't.

Old stuff by Anonymous Coward · 2002-03-07 12:06 · Score: 0

I'd hate to rain on your parade, Taco, but this
is hardly news in the search engine industry. There are in fact search engines which do this today and have done so for quite a while.

Now... by Anonymous Coward · 2002-03-07 12:09 · Score: 0

How about new ways to query the data?
Especially since Google tends to break its advanced operators.
http://www.google.com/help/operators.h tml

That is to say that they sometimes lose their special meeting

*BZZT* please try again... the real origin: by Technodummy · 2002-03-07 12:19 · Score: 5, Interesting

"Before all the ruckus of living in a "global village" where we are all connected via the internet, there was the idea of "six degrees of separation," or the "small world theory." The theory posits the idea that everyone in the world is separated from everyone else by only an average of six people. That is to say, the only thing which separates you from the President, the Pope, a farmer in China, and Kevin Bacon is six people.

It's a strange and beautiful concept. It is fascinating to think that we are all in some way interrelated by only six people or that we have some connection to people even in the remotest part of the world.

The "small world" theory was first proposed by the eminent psychologist, Stanley Milgram. In 1967 he conducted a study where he gave 150 random people from Omaha, Nebraska and Wichita, Kansas a folder which contained a name and some personal data of a target person across the country. They were instructed to pass the document folder on to one friend that they felt would most likely know the target person.

To his surprise, the number of intermediary steps ranged from 2 to 10, with 5 being the most common number (where 6 came from is anyone's guess). What the study proved was how closely we are connected to seemingly disparate parts of the world. It also provided an explanation for why gossip, jokes, forwards, and even diseases could rapidly spread through a population.

Of course, the six people that connect you and the President aren't just any six people. The study showed that some people are more connected than others and act as "short cuts," or hubs which connect you to other people.

Take for example, your connection with a doctor in Africa. Chances are your six childhood friends who you've grown up with aren't going to connect you to someone across the country, much less across the ocean. But let's say you meet someone in college who travels often, or is involved in the military or the Peace Corp. That one person who has traveled and has had contact with a myriad of other people will be your "short cut" to that doctor in Africa.

Likewise, say that you want to figure out your connection to a favorite Hollywood socialite. If you have a friend who is well connected in the Industry, that person will act as a bridge between your sphere of existence and the Hollywood circuit.

The Proof

Mathematicians have created models proving the validity of the "small world" theory.

First, there is the Regular Network model where people are linked to only their closest neighbors. Imagine growing up in a cave and the only people you have contact with for the rest of your life are in that cave with you.

Then there is the Random Network model where people are randomly connected to other people regardless of distance, space, etc..

In the real world, human interconnectedness is a synthesis of these two models. We are intimately connected to the people in our immediate vicinity (Regular Network), but we are also connected to people from distant random places (Random Network) through such means as travel, college, and work. It is by our intermingling with different people that our connections increase.

You may meet someone in class that is from a different country, or whose father works in Hollywood, or whose mother owns a magazine. By this mingling and constant interaction your potential contact with the rest of the world increases exponentially.

The Internet

The Small World theory is interesting in light of recent advances in communication technology--namely, the internet.

You can now instantly make contact with someone across the world through a chat room, email, or through ICQ. In all of human history, it has never been easier to get in tough with someone across the globe.

The great irony, of course, is that although we are making contact with such a vast number of people, the quality of the contact is becoming terribly depersonalized. Our email, chat, and ICQ friends may number in the hundreds, but for the most part we'll only know them as a line of text skittering across the screen and a computer beep.

That's not to say that there is never a cross over from the virtual world of the internet to the "real" world. But a majority of the time, the closest you'll get to actually meeting your fellow e-buddies in the flesh are the pictures they email you (notice how everyone oddly looks like Pam Lee or Tom Cruise), or a series of smilies (meet my friend Sandra :), Jenny :P, Bill :{, and Chrissy 23).

Never in the history of mankind has there been so much technology to keep us connected.This is with so little true connection. Everything from cellular phones, pagers, voice mail, and email were designed so that we would never be alone again. Human contact would only be a few convenient buttons away. But what seems to be happening is that the convenient buttons are superceding real people. Despite the appearance of all this technology, we're still pretty much where we started, with the exception of a motley crew of digital displays, flashing lights, and cutesy computer alerts to keep us company.

Don't get me wrong. The Internet Revolution is great and is making our lives easier. But as with ice cream, money, and sex -- too much of a good thing can be bad (money and sex are sometimes exceptions). What good are all the conveniences and promises of instant material gratification if you don't really live. The virtual world is good, but we shouldn't forsake it for the real world. The macabre image in the Matrix where we are all plugged into computers unbeknownst to us is a parable of what could be our future. A future where people never leave their homes and where we're all so dependent on computers. We wouldn't be able to walk outside without a pang of separation anxiety.

As we enter the new millennium, there is no doubt that we will be living increasingly wired existences. Perhaps Milgram's study will be annotated, and perhaps we will find that we're only separated by three degrees of email. But what good is that if the only "handshakes" going on are between our computers??

Russ

it's been said before by Technodummy · 2002-03-07 13:10 · Score: 2

but it's worth repeating.

Google.com is popular because of it's high moral ground, which it has had since the beginning.

I personally switched to Google because:

* it gave me more accurate results
* it has a fast loading page
* it had an honest results policy
* it's not a parasite site, running on the coat tails of others (eg. metacrawler)

The reasons I continue to use Google are:
* as above
* it has inoffensive (to me) advertising
* it has a toolbar that saves me time on searching
* it's as good as a spellchecker
* it can display pdf files in html
* it can search pdf files
* google cache

you can feel either bad or good... by Technodummy · 2002-03-07 13:18 · Score: 2

they have both... while they get a lot of extra links, some of these "escape" links are really quite sarcastic in their reference to Disney, and it's not neccessarily a compliment to their content... I have seen a lot of humor sites that link to Disney if you can't handle jokes etc... really, I'm not sure that Disney deserves feeling at all, it's merely a quirk of society...

I am currently... by cr0sh · 2002-03-07 13:41 · Score: 2

...about 1/2 to 2/3 of the way through a book I am reading called "Emergence: The Connected Lives of Ants, Brains, Cities, and Software" by Steven Johnson.

At the point where I am reading, Johnson is discussing how emergence is closely tied to feedback, and without feedback, emergence doesn't occur. Thus, cities and businesses tend to be emergent (is that a word?) entities, while the web typically isn't. Because links on the web tend to be "one way", and information isn't communicated back, he argues that emergence can't take place.

Someone else has made a post here discussing how on Japanese web sites, it is expected that before you link to a site, you ask the operator of the site permission. The poster then says that for American sites, it is more of a "sprinkle willy-nilly" (my words) type reference, without regard for the operators of the sites. However, at one time, netiquette was indeed to ask the operators to "swap links" - I remember doing this quite often. But I think what happened is that when businesses and the "ignorant masses" came online, less link-swapping occurred because many times you would email the admin of the machine, and never get a response. The feedback link was broken.

Johnson uses this argument to further his statement that because of this, the Web won't be emergent. But will the Japanese web spawn emergence?

Johnson then goes on to talk about weblogs (though he doesn't use that term), referencing /. specifically - and noting how the whole rating and karma system gives rise to feedback, and may allow such discussion groups to become, over time, emergent. I haven't read anything yet about p2p in the book (the way the book reads, it seems like it was written or originally published longer ago than it seems), but I tend to wonder if emergence will be found there...

These kind of search engine technologies might help make the web turn around, and allow it to become emergent. I don't know if such thing would bode well for humanity, but it would be very interesting to see such a thing in practice (I highly reccommend the book I referenced above if you are into this kind of thing - it makes an excellent sequel of sorts to the book "Out of Control")...

--
Reason is the Path to God - Anon

Teoma by Squalish · 2002-03-07 13:59 · Score: 1

Google does not do this, but Teoma has done it for some time, with results on the same level of accuracy as Teoma.

--
People in Soviet Russia, however, appear to be afflicted with amusing juxtapositions of the aforementioned situation

Reinventing Google's bycicle by sipan · 2002-03-07 15:10 · Score: 1

Sounds like it is too different from Googl's algorithm of using cross links as votes.

Nature missed the point by David+Cohn · 2002-03-07 21:01 · Score: 1

Gary Flake's work is excellent stuff, but I think Nature missed his point. The bibliometrics community has been doing link analysis for *50* years to identify academic communities by co-citation analysis. For some pointers, try Googling "cocitation analysis"

New Search Algorithm?! by Anonymous Coward · 2002-03-07 21:11 · Score: 0

This is a laugh. In Information Systems lingo this is (very) old news. Its a variation on * clustering algorithms *, of which there are plenty.

This is the Interspace :) by bilalak · 2002-03-07 21:33 · Score: 1

The most complete pilot program in this topic is the Interspace I think. This project would lead the whole internet into another new pahse where Abstract content would classify the concepts ( not the pages) and provide another model for distibution media across the Interspace ( future of the Internet) IEEE computer society magazine suggest that this interspace would be online from 2005 Then all what we know bout Online info would chage :) Please have a look at The Interspace Currently I am working on a smilar model for Arabic culture. .Bil

Citeseer by Where's+my+towel · 2002-03-08 00:07 · Score: 1

This research is from the same people who created CiteSeer

but what about ads? by PiGuy · 2002-03-08 01:19 · Score: 0

Web pages contain lots of links to ads, some even more than they have useful links. If user-end solutions can barely filter these links/images out correctly, how is a server-side search engine supposed to do the same w/o slowing down the search immensely and/or misjudging many links? On the other hand, it would be a good way for the search engines to get advertising in...

Not a problem at all by TuringTest · 2002-03-08 01:46 · Score: 1

Since so many of the Adult sites seem to have their "Please leave now..." links pointed at disney.com or nickelodeon.com or something.... will they end up in the adult communities?

Don't worry... the cicle won't be complete until disney.com links to many of the Adult sites! 8)

--
Singularity: a belief in the "God" idea with the "demiurge" relation inverted.

Re: Not a problem at all by bmorton · 2002-03-08 02:55 · Score: 1

Don't worry... the cicle won't be complete until disney.com links to many of the Adult sites! 8)
"If you are over 18, please click here immediately."

Adversarial Information Retrieval by volts · 2002-03-08 01:55 · Score: 1

I worked in the search engine industry for some years. Term 'search algorithm' is a facile gloss over what is a complex business, usually meaning how a set of indexed objects are selected and ranked for presentation to the user. Link analysis, clustering and 'communities of interest' have been heavily investigated for years by both commercial and academic groups. There are various flavours ranging from a global popularity measurement, like Larry Page's "Page Rank" (part of Googles ranking algorithm) to damped term-frequency schemes in which pages linked to and linking from a page contribute search terms (keywords) on a sliding scale based on link type and 'distance' from the page. It is very hard to tell from the press release the particular twist these guys are applying.

The problem with most of these schemes is that they don't scale worth a damn or fail in the real world. What works on a small number of sites fails completely when you want to enable users to search the 'entire web'. Web search engines run very tight code on very many machines and have to cope with the fact that there is a lot of crap out there. Assumptions about link popularity follow the old rule that an optimist is likely to be dissappointed but a pessimist can always be pleasantly surprised.

Search engines have two ways of discovering pages, spidering and direct submission by site owners. The is quite an industry spamming them both ways. Spam here meaning trying to trick the engine into returning your site as relevant to a query when it isn't. Everyone wants to be top of the result list, even for searchers who are looking for something else - the goal is to drive traffic by any means.

All the engines are secretive about their spidering and ranking approaches because they are in an ongoing arms race with the spamming industry. Andre Broder at Altavista describes himself as being in the business of "adversarial information retrieval". 95% of submissions to Altavista are rejected as spam. At one time Infoseek had a site try to submit 2 million pages for indexing and was forced to limit all sites to one page.

Spammers try to increase their rankings by cross linking huge numbers of URL's, this is one reason why there are so many porn pages that consist of nothing but banner adds and links to other pages of the same. So link clustering analysis is used to eliminate many of these from being indexed.

Its pretty amazing that Google, et al produce results as good as they do!

Sorry about the pissing contest by John+Harrison · 2002-03-08 03:56 · Score: 1

You asked how I knew. I answered. I am sorry about being annoyed. Thanks for your response.

--
Lasers Controlled Games!

Damn! by ahogue · 2002-03-08 04:16 · Score: 1

They stole my idea for the Google Programming Contest!

Related Sites from Links - Try Alexa by markwelch · 2002-03-08 04:19 · Score: 1

I checked some of my sites at Alexa, which tries to identify "related" sites. Sometimes, it shows mostly my other sites (since most of my sites link to each other). But often, it actually seemed to identify sites that are in fact "similar" or "related" in ways that seem correct. Note that Alexa uses actual user-activity data from its toolbar users, but it also appears to integrate that data along with keywords and links within sites. Try it with your own sites, or a site representing a particular community of interest, and see what results you get: http://info.alexa.com/

--
-- http://www.MarkWelch.com/ Pleasanton California

researchers also have conflicts by Untimely+Ripp'd · 2002-03-08 05:25 · Score: 1

often the case when there is no conflict of interest for a document to link to another document (as in the case of researchers linking to other works in their field)

How drolly non-cynical. In fact, "citation inflation" is rife in the academic community: people cite their own papers, and they cite their colleagues' papers. Why not? It's free to cite your friends, it boosts their careers, and it's likely to result in a reciprocal citation.

Closely related is the practice of adding authors to your papers. I've seen CVs full of listings with six or more authors. It seems like a reasonable thing to do -- giving credit to everyone who helped you develop your ideas -- but what it really does is water down the value of the authorship, as suddenly a person who has only ever really participated in two or three meaningful projects has a CV with four pages of publications. (Or 8, or a dozen!)

--

And let the angel whom thou still hast serv'd tell thee ...

Bothered by the yahoo link in position 1? by BlueUnderwear · 2002-03-13 03:11 · Score: 2

Just search for mouse, please leave now, or better: mouse, please get the hell out of here

--
Say no to software patents.

Slashdot Mirror

Interesting Concepts in Search Engines

230 comments