Indexing Dynamic Sites For Search Engines?
Moeses asks: "I am working on a Web site that uses the Altavista search engine software. The latest version of the site has moved most of the data from static pages to dynamic pages. This causes some issues to arise, but I've developed work arounds for most of them, such as generating pages with URL's that contain all the query string information to index the whole database and code to handle situations where a user searches for something that can't be displayed because of some state information specific to that users session, but there are still enough issues that I can't index all the states of the files that I need. Building a custom search engine for the database isn't within the budget of this project. What are you others doing to index and search your dynamic sites?"
The author has revisited this subject. Here is the updated version. IMHO very good article.
If you have a more sophisticated search engine that can deal with item tagging (for metadata like keywords, creation dates, authorship, description, title, etc.), all the better. Create your text files with the appropriate tags and metadata pulled from your database and get that indexed too, and when displaying search results you can parse it back out of the text file or straight from the database if you want. Verity's engine is very nice for this.
i have this problem as well. so i did the halfway solution. i wrote a simple script that iterates through the dynamic pages outputting them as static html files. these files are submitted to the search engines.
the problem is of course the pages end up getting old. no problem, add a little "this is an archived version of this page, please click here for the newest version" message. rerun the script when necessary.
i did this and was able to submit all my dynamic pages to altavista. what i also did was add an additional little "prev | next" link at the bottom, so a spider could start at one page and follow links to the end. i went further and created a hallway page to submit to altavista.
also, the pages are flat so they tend to load faster than dynamic ones.
check out the page i submitted to AV, and old archived page (contains the links prev|next links @ bottom, or the live homepage
NEWS: cloning, genome, privacy, surveillance, and more!
NEWS: cloning, genome, privacy, surveillance, and more!
You have too much free time to read my signature.
A sig is redundant.
There is an old article on PHPBuilder.com that describes a meathod for creating dynamic, indexable pages. The article is written for PHP, but you should be able to use the same technique with other languages. Even if it doesn't work for all your pages, it still is a useful technique.
use mod_rewrite to make your dynamic pages look like static html.
.*_id(.*)\.html$ news.php?id=$1
.htaccess.
... ;) but this technique doesn't seem to work with all searchengines.
....)
An example:
you have a script called news.php and an news index id (news.php?id=42 i.e.).
You could map that to
news_id42.html with
RewriteEngine on
RewriteRule
in your
Voila ! your dynamic content looks exactly like a static html page.
Anoter one is to fool searchengines that the script is an directory:
foobar.php/param1/param2/
Works perfectly fine
(don't remember which
regards,
Michael
Samba Information HQ
Depending on the size of the site, indexing in background constantly is a good solution. If the site is big, make the server code generating dynamic pages support If-Modified-Since and use a search-indexer-spider like Alkaline.
dB@dblock.org
dB@dblock.org