Well, this is not entirely on the topic of indexing dynamic content, but bear with me.. the increasing difficulty of getting relevant search results has long been a pet peeve of mine. There are several factors that make good results hard to find:
1. sites that abuse meta tags and include pages of keywords just to get hit more - this makes results of keyword searches less relevant. 2. the explosive growth of the web - making 'quality' sites more difficult to find amidst the deluge of junk 3. dead links, outdated information, changed dynamic content indexed by search engines - the web changes too quickly for most engines to keep up
So what can we do to get high quality, relevant results without weeding through pages of URLs? It's not easy, but I've been playing with an interesting approach. First off, different search engines use different indexing/ranking methods: keywords, meta tags, link count, traffic stats, user recommendations, human categorizing. By combining the results of several engines using different index methods, you can cross reference the results and see who appears on all the engines. This gives you at least *some* degree of assurance that the URL matches your query. These results are ordered by number of engines that reported the link.
Now that we have maybe 20 or 30 semi-relevant URLs, the next step (which I have not coded yet) is to retrieve these pages and parse them based on natural language processing techniques. This should give a good idea what kind of actual content the page holds - ie, is it an order form?, a page full of pictures?, a magazine article?, a threaded discussion?, etc.. From that, and from stored or learned user preferences, a better list of results can be show to the user.
OK, so the easy part of this is done, and just automates some manual searching:) The drawbacks? Well, it still doesn't solve the problem of dead links very well, and it's slow (approx 30 seconds).. it has to hit 6 engines and collate and analyze results before you see anything. Adding the language parsing will make it even slower. Cacheing results in a database could speed up common searches, expiring them periodically and refreshing in the background..
Is this where searching is headed? Maybe.. I don't pretend to know, and only started messing with it out of frustration. In any case, it seems to work pretty well already, and could probably be expanded into a pretty decent agent/search tool (open source, of course!).. If anyone is interested in helping to develop such a beast, let me know.
Well, this is not entirely on the topic of indexing dynamic content, but bear with me.. the increasing difficulty of getting relevant search results has long been a pet peeve of mine. There are several factors that make good results hard to find:
:) The drawbacks? Well, it still doesn't solve the problem of dead links very well, and it's slow (approx 30 seconds).. it has to hit 6 engines and collate and analyze results before you see anything. Adding the language parsing will make it even slower. Cacheing results in a database could speed up common searches, expiring them periodically and refreshing in the background..
.. If anyone is interested in helping to develop such a beast, let me know.
1. sites that abuse meta tags and include pages of keywords just to get hit more - this makes results of keyword searches less relevant.
2. the explosive growth of the web - making 'quality' sites more difficult to find amidst the deluge of junk
3. dead links, outdated information, changed dynamic content indexed by search engines - the web changes too quickly for most engines to keep up
So what can we do to get high quality, relevant results without weeding through pages of URLs? It's not easy, but I've been playing with an interesting approach. First off, different search engines use different indexing/ranking methods: keywords, meta tags, link count, traffic stats, user recommendations, human categorizing. By combining the results of several engines using different index methods, you can cross reference the results and see who appears on all the engines. This gives you at least *some* degree of assurance that the URL matches your query. These results are ordered by number of engines that reported the link.
Now that we have maybe 20 or 30 semi-relevant URLs, the next step (which I have not coded yet) is to retrieve these pages and parse them based on natural language processing techniques. This should give a good idea what kind of actual content the page holds - ie, is it an order form?, a page full of pictures?, a magazine article?, a threaded discussion?, etc.. From that, and from stored or learned user preferences, a better list of results can be show to the user.
OK, so the easy part of this is done, and just automates some manual searching
Is this where searching is headed? Maybe.. I don't pretend to know, and only started messing with it out of frustration. In any case, it seems to work pretty well already, and could probably be expanded into a pretty decent agent/search tool (open source, of course!)
--segfault at netwinder dot org
a big "hello dj wandle" from the former corel computer engineering team :)