Slashdot Mirror


Is the Internet Becoming Unsearchable?

wergild asks: "With more and more sites going to a database driven design, and most search engines not indexing anything that contains a query string in it, we're missing alot of content. I've also heard that some search engines won't index certain extensions like php3 or phtml. Is anything being done about this? How can you use dynamic, database driven content and still get it indexed into the major serach engines?" Is keyword searching obsolete? Do you think its time to index sites by the type of content they carry rather than the content itself? Will larger indexing databases (or a series of smaller, decentralized ones) help?

313 comments

  1. This is why generic domains are worth something. by Anonymous Coward · · Score: 0

    But I'm not saying this necessarily justifies some of those million-dollar sales.

  2. Directories by Anonymous Coward · · Score: 1

    The one constant throught the history os internet technology, is that directory entries (open directory, yahoo) with structured, categorized results are and always will be superior to free text search for anything that isn't completely obscure.

    1. Re:Directories by mcrandello · · Score: 1

      That's the problem. Obsure is all I ever deal with. You are right about heirarchies, however. miningco is a pretty nice one as well. Oh, and what became of newhoo also....Just thought I'd point them out because I had almost completely forgotton about dmoz, and I keep stumbling across miningco pages from other searches, and they sem worth mentioning...


      mcrandello@my-deja.com
      rschaar{at}pegasus.cc.ucf.edu if it's important.

    2. Re:Directories by shub · · Score: 1

      Directories are nice, but I think IBM's Clever project demonstrates that intelligent searching can actually do quite a whole lot. Read the Scientific American article on the subject for more information.

      Of course, there's also Google, which although perhaps not quite as intelligent as Clever, seems to be light-years ahead of everything else I've ever seen.

      --
      Brad Knowles
      http://daily.daemonnews.org/ -- if you're not
    3. Re:Directories by maney · · Score: 1

      Maybe the solution would be, partially, to move to a categorical system that would automatically assume that all pages are personal pages unless you explicitly specify something else in your metatags or robots.txt file. Then you divide your databases along the lines of categories (say: personal, business, organization, government, education. You know, like the way the top level domains were intended to work).

    4. Re:Directories by Anonymous Coward · · Score: 0

      I use http://oneMission.com for searching and adding pages. It's like the open directory project, but without AOL above it. Marc.

    5. Re:Directories by Balazs · · Score: 1

      categorized results are and always will be superior to free text search

      Not necessarily. There are searches that aren't possible in a directory (I teach net searching at the University of Vienna).

      Try to find box office earnings of a film by browsing Yahoo. You'll need to read many pages; with FAST Search you really only need to enter the keywords. (At least that is my experience in the course.)

      --
      Computers. You can't live with them, you can't live without them.
    6. Re:Directories by greenrd · · Score: 1
      What became of newhoo also

      That turned into ODP. Well, we're only a tiny outfit, only 1.3 million sites, only 20,000 editors, only being used by AOL Search, Netscape Netcenter, Hotbot, Lycos, Altavista... All thanks to our Open Content license! See for yourself.

  3. Directories by Anonymous Coward · · Score: 0

    The one constant throught the history of internet technology, is that directory entries (open directory, yahoo) with structured, categorized results are and always will be superior to free text search for anything that isn't completely obscure.

  4. Catchup by Phule77 · · Score: 2

    I think we've actually hit another period, technologically, where we're advancing too fast for active standards on "how things should be done" to make things like searching pages/web databases/etc. an accessible, easy thing. It's probably going to take a while...it seems like every month they come out with a new way of doing things, a new "language that's going to change the world!", a new proprietery language/program for corps to use. Until that dwindles, for whatever reason, the web is going to continue to be behind in terms of searchability.

    --
    Listen to me Peter, I want this bench. You go sit on that bench over there, and if you're good I'll tell you the rest of
    1. Re:Catchup by NotQuiteSonic · · Score: 1

      It's true. What we need is a new standard. The thing that bugs me most about search engines is getting multiple hits on a particular site. If you could distribute the load of the search to the site itself (have a standard search db that could talk to the search engines) and simply index the general idea of the site through meta tags. Then using a search engine (which is more like a distributed database browser) you can browse the individual and up-to-date search dbs of the site.

    2. Re:Catchup by Chocky2 · · Score: 1
      I suspect we'll see the emphasis shifting towards more specialised, manually or semi-manually, compiled indices of websites, complemented by robots searching these manually created indices for new/relevant/expired items of information/people/links.

      And alas, as far as the crap that accounts for 75% of the web goes, the cost of accommodating the vast quantities of crap is less than the cost of removing it or improving ways of avoiding it; and until that is no longer the case we're going to have to put up with ever increasing quantities of sewage.

  5. First comment? by Anonymous Coward · · Score: 0

    anyways this is very true. I have a site that all of it is database driven and it uses mysql and php and the pages.

  6. Searching searches? by Baloo+Ursidae · · Score: 1

    One idea would be have a centralized authority have individual machines scan for sites ala distributed.net to expand existing databases. Would this be possible?

    --
    Help us build a better map!
    1. Re:Searching searches? by MrIgnorant · · Score: 1

      The distributed idea is interesting, but I think the problem lies not in the actual power and bandwidth of the search, but more in what exactly we are searching for. Much like the article says, I've noticed myself that many of the search engines today don't find exactly what i'm looking for and I myself am still stuck sifting through their results.

      I think the search engine community needs a paradigm shift in their way of approaching searches now, with the curve dynamic information has thrown at them. I don't know how well standards would work in this situation. It's up to the search engines to come up with a new way of sorting the huge ammounts of data they collect in an orderly fashion so they can serve us searchers with exactly what we ask for. Ok, so "exactly" is probably stretching it a bit, but I'll settle for pretty damn close.

    2. Re:Searching searches? by Baloo+Ursidae · · Score: 1

      This is a very good point. However, Im thinking something along the lines of a distributed Google, where its pretty hard to doctor it (short of a script that possibly could get locked out IP wise due to hammering). This statement was made under the rather wreckless (but possibly true) assumption that Google finds what you're looking for by using other people's successful results.

      --
      Help us build a better map!
  7. Extend the Robots.txt protocol... by Anonymous Coward · · Score: 3

    ...to "force" search engines to search certain pages. Currently, you cna only tell searchbot to "piss off". There is no way to tell a searchbot "hey!!!! come look at this...."


    1. Re:Extend the Robots.txt protocol... by technos · · Score: 2

      This is probably the simplest solution. Just use an 'insteadof' tag in robots.txt to redirect the `bot to a meaningful page. As is, sites are using hidden pages gobbed full of metatags, relevent text, etc. Point the bot there! You could perhaps just use a space delimited section of the database, and include some standard for how the searchbot indexes the new 'insteadof'.
      'Karma, Karma, Karma Chamelion' -- Boy George

      --
      .sig: Now legally binding!
    2. Re:Extend the Robots.txt protocol... by Hard_Code · · Score: 1

      Karma, Karma, Karma Chamelion' -- Boy George
      "Now I think the Karma Cops are after you." --Aerosmith
      ...
      "This is what you get...This is what you get...This is what you get, when you /mess/ with us..." Karma Police, Radiohead

      Jazilla.org - the Java Mozilla

      --

      It's 10 PM. Do you know if you're un-American?
    3. Re:Extend the Robots.txt protocol... by phrogman2 · · Score: 1

      If we added an extension to the Robots.txt file that said look here for the content, then published search-relevant static content derived from the dynamic content to those pages so that any visiting spider can use this (otherwise inaccessible) static content to assess the site this might work.

      This would work perfectly if the web was still Academic in nature. However, now that Business has got its grubby little hands on the web, it is no longer feasible. The first thing that would happen is Porn sites would derive static content that was of interest to regular searchers but which did not in any way reflect their actual content, in the same manner that META tags are abused today. Similar misrepresentation of every sort is endemic on the web. I deal with it daily at work.

      The only reliable methodology is to actually index what is visible on the web. If this means we have to figure out how to do dynamic queries when spidering then so be it. Implementing this is another matter :)

      One mistake I notice a lot here on /. is people thinking that when they do a search query, the search engine actually looks at websites. This is an elementary, but important mistake. When you do a query using a search engine, all you are searching is their index of that portion of the web that they have spidered previously. The more recent the spidering of a site, the more recent its results. Actually searching the web as a result of a query would be generally way too time consuming to satisfy most users. Do you want to make a query and get the results back hours later?

      Just my thoughts...

      (Note: my views are not necessarily those of my employer) :)
      Phrogman
      Cybrarian@maplesquare.com

    4. Re:Extend the Robots.txt protocol... by acarlisle · · Score: 1

      . . . that way, all pornography sites have to do is put the entire dictonary on their site, then every search you do will include their page. Great idea - currently, not enough of my search results are off topic.

  8. Distirbuted Databases? by Anonymous Coward · · Score: 2

    What if we just have a standard search interface that can be built in to any DB driven website....say it returns XML or WDDX or something. So now when the search engines hit a DB driven site, it goes ahead and creates an index through this interface. I guess like a DNS zone transfer.....hmmmm...

    1. Re:Distirbuted Databases? by HerrNewton · · Score: 2

      Actually you can do something similar already. Just sniff for the robot's user-agent and display different content for the robot, like a very structural, correctly coded HTML index of your site.

      --

      ----
      Am I the only one who thinks Microsoft is a misnomer? Perhaps Macrosoft would be a better fit?
  9. The effort put in by sites makes up for it. by ItsIllak · · Score: 1

    I think the effort many of us put in to make sure that the relevant site info is indexed by the engines makes up for it. Many of my sites include special pages that only the search engines get to increase their "relevance" in the engines databases. What does pose a problem is where people are totally abusing the indexing methods to get their site promoted in searches that they shouldn't. I don't see that anything can be done about that (and efforts by some engines, including ignoring meta tags etc are quite annoying)

    1. Re:The effort put in by sites makes up for it. by Anonymous Coward · · Score: 0

      The engine I'm working on doesn't read meta tags. But neither do humans. It also runs a grammar analyzer on the page to determine if the content is real. Bye bye keyword lists.

  10. There are ways... by RFINN · · Score: 1

    My company created lots of dynamic sites with
    dynamic content - without the use of different
    extensions or URLs that contain query strings.
    Apache is awesome (in case you haven't heard)!
    Almost all of our HTML files actually contain
    embedded TCL code, so the servers are configured
    to parse every *.html file - allowing to use
    the *.html extension for files that have dynamic
    content. We also use things like mod_rewrite
    to send data to a single file that tells the
    file what data to use and how to behave. We
    could have an entire range of sites served out
    by a single file... even making it look like
    they have thier own directories, when in
    reality they don't exist.

    --
    -- Richard Finn http://www.random-seed.com/
    1. Re:There are ways... by SpdyVkng · · Score: 1

      My thoughts exactly.

      Moderate the parent poster up, way up.

      --
      The Speedy Viking
    2. Re:There are ways... by mcdonc · · Score: 1

      Designing a heavily database-driven site which uses URLs without query strings is very possible.


      One of the unofficial mottoes of Zope is that it gives you "URLs you can read to your mother." This sets it apart from other app servers like Vignette Story Server, et. al. and many home-grown Perl solutions.


      In Zope, there is really no such thing as a static page. Straight HTML pages are stored as Python objects in an object database and are rendered on-the-fly just like any other object.


      You don't need to tack all that stuff on to the end of your URL. Really.

  11. Not at all searchable by lapsan · · Score: 2

    We've been running across problems related to this in my office (a web design/hosting/advert firm) and, while I'd like to see non-database driven searching of the Internet continue, I have to say that perhaps, most people, would rather have the database. So many web design clients expect that once they have a web site they won't have to advertise in print ever again are driving the whole thing toward the database method... creating the problem they so love to bitch about.

    Perhaps doing away with keywords entirely, getting search engines to look at the content instead of the "false content" of meta tags... now that would be nice.

    1. Re:Not at all searchable by ChristTrekker · · Score: 1

      So then you put invisible content in the page instead. Same result.

      There will always be a way to "fool" indexing robots if you're creative enough.

    2. Re:Not at all searchable by leper79 · · Score: 1

      correct me if i'm wrong, but isn't google.com asctually searching the pages themselves and mostly ignoring the meta tags there.....? i have noticed that they are on of the first sites to not bring up shitloads of porn ads when you search for "free stuff" or something like that....

      --
      403: Forbidden - you do not have permission to access .sig on this server
  12. First post! by Anonymous Coward · · Score: 0

    web suckz!!!!!!

  13. Search engines are useless. by jagz · · Score: 1

    Once one site is found an a certain topic, it will often be linked to many others; you can find a lot of information this way. I only use search engines in extreme cases.

    1. Re:Search engines are useless. by Anonymous Coward · · Score: 0

      Yeah, but how did you find that first site?

    2. Re:Search engines are useless. by SpaceCadet · · Score: 1
      I disagree. The sites are linked to each other, yes, but this creates an "island" of like-minded sites, and, more often than not, like-minded individuals, as well. True research requires finding more than one point of view on the topic.

      What's needed is a search engine that can find ALL pages relating to the topic, and intelligently search through those sites and the sites they link to - their "islands" - to find as much information as possible.

      Of course, while we're at it, it'd be nice if the search engine printed it up in a nice summary format, with references and links, but let's start with the basics.

      --
      -- The meek shall inherit the Earth. In very small plots, about 6 feet by 3.
    3. Re:Search engines are useless. by flux · · Score: 1
      Google does exactly this. I don't actually use any other search engines nowadays anymore - with most queries there's no chance in finding it with for example Altavista, it returns so much crap. Google has a linux-search too.

      Also, according to this, google uses linux, and on many machines too :).

  14. Customers :) by DanaL · · Score: 2

    We had a client once who wanted keywords inserted dynamically into the metatags on his webpage based on query results because he read once that search engines index pages based on the tags. Nothing we could say would convince him what was wrong with that picture.

    Is it even possible to index dynamic pages? They don't really exist until the page is generated. Perhaps the best thing to do for sites that want to be indexed is to make sure they have a plain, vanilla index.html page that contains relevant keywords?

    Dana

    1. Re:Customers :) by justin_saunders · · Score: 1
      Sorry, posted this initially as a new topic :p

      Dynamic pages don't exist until you click on them in your browser either. Search engines *will* follow links to dynamically generated pages.

      The point is there has to be a link there in the first place. Crawlers will not be able to index a dynamic page if it is only accessable through a "form" post.

      The way you can get around this is to have a hidden (to users) page on your site with hardcoded (or database generated) links into the dynamic content that you'd like visible from search engines.

      For example, if you have a whole heap of news articles on your site, with one per page, you can make a dynamic page called "newslinks" which, when generated by a crawler, querys the database and writes links to every news article in the site.

      cheers, j.

      --

      "My cat's breath smells like cat food." - The Tao of Ralph Wiggum.
  15. diligent searching. by mcrandello · · Score: 1

    First hit google. Then metacrawler. Then try it as a phrase, then add "-" terms to filter useless results. After that ask jeeves, then the imdb, ubl, mp3.com, amazo^H^H^H^H^Hbarnes&noble online, the manufacturer's sites, then give up and ask someone for it on alt.binaries.whatever.

    Short answer, yes. Long answer->I'll find it if given an afternoon or two.


    mcrandello@my-deja.com
    rschaar{at}pegasus.cc.ucf.edu if it's important.

    1. Re:diligent searching. by The+Fonze · · Score: 1

      ok how about this. one site sends out a bot. this bot hangs out on the remote machine, collecting info, or filling up its bag with data, then returns to report its findings. this way, the local remote site can then have rules on what is public info, info for a particular client, or not searchable at all, private. I've been thinking about this lately as a kind of shopping bot. there are lots of potentials, and lots of security issues...anyone for an apache mod?

    2. Re:diligent searching. by Anonymous Coward · · Score: 0

      Google? Please, that site is the worst of the bunch even if it does run on precious linux.

  16. Rethink the way we index? by Anonymous Coward · · Score: 1

    As is, search engines just index raw HTML, with no regard for the actual content of the pages. Perhaps as XML and related stuff begins to proliferate, the indexers of the future will begin to use the extra markup to deduce things about the data that are relevant to the searchers. Certainly it needs to be rethought, because as it it is, it's crappier than even searching for text in Emacs. Think of the internet as the worlds largest text file and you're trying to find things using a simple search in a myopic text editor that can only see 1/100 of the whole document anyway.

    1. Re:Rethink the way we index? by Buttercup · · Score: 1

      This is the sort of thing RDF was designed to overcome. If search engine providers would agree upon an RDF model for describing content, Web site designers could begin building it into site. That would allow accessing databases automatically, and the standardized formatting would allow search engines to return queries in novel new presentations, like the AppleSauce plug-in of yore.

      MJP

      --
      Don't try that "protecting the children" shit you people use to keep the tits and bad words off my TV. --Seanbaby
    2. Re:Rethink the way we index? by bolthole · · Score: 1

      This is why "miningco" was started.(now called "about.com" ?)

      the concept being that you get actual people acting as a cache maintainer for useful data, under specific categories. Sort of like a distributed yahoo, the way they used to be before they sold out.

      IF there is an existing category for something you are looking for, then you have a good chance of finding pre-"mined" information there.
      Only trouble is, you are relying on the expertise of the "guide" on that particular subject.

    3. Re:Rethink the way we index? by seaportcasino · · Score: 1

      I agree. The best way to avoid porn spam is to have a human eye look it over before it is published. This ties in with another slashdot article regarding the growing of human eyes in a petri dish. If we could put these eyes to work, say connect them to an old EyeBM Mainframe, we could some day have factories of these eyes filtering out the porn spam.

      Of course, an eye alone could not determine whether a site was a porn site or not. We would have to attach these eyes to something else, say maybe a penis. The EyeBM mainframe could then just run a simple logic test, "If penis is erect, filter out site".

      Of course, an eye and a penis alone could not determine whether a site was a porn site or not. We would have to attach...

  17. Sort of... by Anonymous Coward · · Score: 0

    It isn't too bad if you're looking for obscure
    things; for example if you get a weird error
    message from a Linux utility, or song lyrics.

    But try to search for a device driver and you
    get those "ad bait" sites like driver-forum.com.

    This is a serious problem... there is a big
    opportunity for a search engine that will be more
    selective about keywords and will reject sites of
    dubious value like driver-forum.com

    Mark

  18. My sites get indexed by Anonymous Coward · · Score: 0

    All of my sites are dynamically generated, using PHP3, MySQL, etc., and they do get indexed in AltaVista and alltheweb. I'm not sure about other search engines, but those two find my sites just fine.

  19. Not a problem by Tom7 · · Score: 1

    #1: it's easy to make apache run cgi scripts with any extension you want, so php3 and shtml being ignored shouldn't hurt any site that really wants to be indexed

    #2: technologies like XML may give a standard interface to databases, so that search engines can index databases directly.

    IMO, A much bigger threat to the "searchability" of the internet is the rapidly growing amount information -- and with it, the amount of misinformation.

    1. Re:Not a problem by Xuli · · Score: 1

      XML, as wonderful and dynamic and potentially-world-saving as it is, is far from standardized. In addition to this, the only serious developments I've seen of any semblance of an XML search engine are being developed and marketed to be used by businesses in their intranets because, well, that's where the $$$ is.... So, in short, my opinion is XML is the only possible savior of exponetially expanding undocumented, unserchable content, it just needs more attention from the user/consumer commuinity.

      --
      "I'm disrespectful to dirt! Can you see I am serious?"
    2. Re:Not a problem by evilpenguin · · Score: 3

      I hate to do an "amen to that, brother" post, but I'm going to do so.

      Any reasonable search term is likely to present results like "Search returned 417,373 hits. Hits 1-10 displayed." You have to then winnow by adding include and exclude words until you get it down to a manageable 7,422 hits, then you browse them.

      The truth is, I turn to wide searches quite rarely. I tend to find and "bookmark" authoritative sites I find on a given topic and return to those over and over again. It is only when a site grows noticably stale or I have to research a new topic that I turn, reluctantly, to search engines. As for indexing database sites, I like the idea of extending the robot hack. Slightly less appealing would be to have a new HTML tag to include "bot content" in any page, including dynamic pages. An XML solution is a good idea, but I wonder how long before every extant site gets XML-aware? That plus XML is almost too flexible, making it likely that a hundred competing methods for indexing dynamic pages will appear and no one will know which one to cling to.

  20. Parallel static pages for search engines by Ravenfeather · · Score: 1
    How can you use dynamic, database driven content and still get it indexed into the major serach [sic] engines?"

    One obvious possibility is to generate - using the database - a set of static pages as "targets" for the search engines. This could be done weekly or monthly, for example. Each target page would contain a prominent link to the dynamic database-driven front-end of the website, so that searchers could find the site and then quickly get directly to the main front end. Not particularly elegant, but it seems like a reasonable work-around for the time being. The real solution, in the long run, will involve more sophicated searching and indexing paradigms.

    What do people think about this approach?

    1. Re:Parallel static pages for search engines by lesjones · · Score: 1

      isps.com did just that so they would be indexed by the search engines. The Links engine (see http://www.56k.com/links for an example) stores links in a database, but generates static html pages.

      Everyone has been talking about web-wide search engines. Having a mix of dynamic and static content also presents problems for a site-wide search engine, unless you have the scratch or programming savvy to write a local search engine to query both.

  21. dammit I forgot gopher... by Anonymous Coward · · Score: 0

    and webluis. Make that 2-3 days...

  22. Parsing .html with PHP3 not impossible by Balazs · · Score: 2

    You can tweak Apache to parse documents ending in .html with PHP3. You could use .html for generated content and .htm for static pages.

    --
    Computers. You can't live with them, you can't live without them.
  23. Database driven web pages are 'spam' by GaspodeTheWonderDog · · Score: 3

    Yeah, give me a minute to back that statement up. :)

    Honestly though. With something that is inherently dynamic like the internet, it is already near impossible to catalogue and make it searchable. Just to illustrate this take any given news site. Today they might have articles about Clinton, tomorrow it might be news about a big fire. Search engines can't just direct you to those sites based on queries because who knows what data they have.

    Even if a search engine was able to validate the content on every site before it gave you the url it could still change by the time you actually got to see it.

    So quite literaly there isn't even a clue of a way to catalogue a database generated web site. Now granted I know there are plenty of sites like Slashdot that eventually the 'content' settles down and becomes static. Still, how are you going to get some stupid program to verify and validate that for *every* dynamically generated web page. I don't think you can.

    The web was created to be open and dynamic and it will stay that way. I've heard people say that maybe there should be *more* interoperability between things like search engines and spiders. This in my mind would do more damage.

    Besides is it so bad that spiders don't get these pages? It probably isn't even reasonable because it would add that much more complexity to the search engine to catalogue what it finds. How do you rank content?

    Anyway... just my 2 cents or so...

    --
    This space for sale
    1. Re:Database driven web pages are 'spam' by CodeShark · · Score: 1
      Great logic in your post. Unfortunately, I don't think that they support your subject assertion, because they do not take into account that some data driven sites are not just "newsy" -- they try to present content in a clear, cohesive manner, using a database to do so. In these cases, the pages are somewhat static, but is more like a catalog which keeps growing.

      The best example I can think of is genealogy sites -- as data gets added, it gets put into a hierarchy, but doesn't change much thereafter. Using this example, I may want to find (via a search engine) references to the Wallace B. XYZ family genealogy. Perhaps I contribute a new "branch" to the tree -- is this spam? No, it's information which other genealogical researches may need to be able to find.

      As always, I welcome further discussion...

      --
      ...Open Source isn't the only answer -- but it's almost always a better value than the alternatives...
    2. Re:Database driven web pages are 'spam' by Matts · · Score: 2

      Only really bad db driven web sites use changing content on static url's - almost every site that runs off a db (take those news sites you mention: cnn, news.com, cnbc, etc) redirects you to the content page which is either a unique identifier or a url with an embeded date. I see no reason why that can't be indexed - and in fact it is indexed.

      And yes, even slashdot uses this scheme :)

      --

      Matt. Want XML + Apache + Stylesheets? Get AxKit.
    3. Re:Database driven web pages are 'spam' by atom · · Score: 1

      Of course, the /index.html page of a db driven site would change constantly.

      For instance, www.slashdot.org/ might have an article about Natalie Portman in the root page when the search bot comes by, but it'll be gone in a day.

  24. Internal Site Searches More Difficult as well by peterdaly · · Score: 2

    No only is multipul site searching becoming more dificult, but single site searches as well.

    Now most content is stored in a SQL database. While it is fairly easy to search an SQL database, returning the information in usable form is not. This is especially true once you have many type of tables containing many different types of information.

    Currently, the search engine on the site I work on has it's own built in forms for information from each type of table, but this method takes a lot of maintainance.

    Another possible way is to point to the page (php3, asp, .pl, .cgi, etc.) which generated the information. But this only works if arguments are not required.

    It is about time someone developed some technology to do "smart searches" of sql data and return useful information without having to write a template for each and every type of data that might be queried.

    I might be off my rocker a little bit on this, but I cannot believe I am the only one experiencing these problems.

    -Pete

    1. Re:Internal Site Searches More Difficult as well by flanker · · Score: 1
      Not sure if this is what you mean but I know with AOLserver you can register URL to tcl library routines so that dynamic contents from a query looks static. For instance, you register the script my_write to handle all requests under the fake directory /query:

      ns_register_proc /query my_write

      Then inside the body of the tcl procedure my_write you parse the rest of the URL and generate the appropriate dynamic content. Lets say you wanted to do some content based on 2 values, instead of a GET URL that looked like:

      http://mysite.com/query?val1=x&val2=y

      You would have one that looks like:

      http://mysite.com/query/x/y/dyn.html

      Note that the directory hierarchy and the file name in the URL are completely ficticious (the have no relation to the filesystem of the webserver).

      --
      Left shift 1 for e-mail...
  25. Use dynamically generated static pages by seyed · · Score: 1

    Best thing to do is to create static versions of dynamic content that you want to index (like articles etc.) and use scripting to divert non-robots to a dynamic version.

    You can also make those static pages keyword and meta tag heavy without affecting the user experience.

    --
    "Everybody's Got Something to Hide Except for me and my monkey" - The Beatles "If you're not part of the solution, you'
  26. New method? by ChrisGB · · Score: 1

    I was beginning to think this as well - Yahoo, Infoseek, Hotbot and the like just don't seem to find the good stuff anymore. If there's content held in a database and a page is generated on demand by an active server page or CGI script for example then the page doesn't come into existence until the user requests the information.

    Perhaps it's time for search engines to search by topic and direct to a site related to the enquiry. The individual sites could then have their own search utilities to trawl through their databases? Not sure if this is feasible or not though.

    In terms of good search engines though - Google and AllTheWeb.com seem to find good content whenever I use them. The problem I guess is that you don't know what you're missing until you find it by some other means, and neither do the search engines.

  27. Uses Keywords Luke! by ninoles · · Score: 1
    META tags and PICs-like protocol should be used. However, WISIWIG editor doesn't help much since META tags aren't a visual parts of the html document.

    I also find that self-registred index sites (like WebRings) can be useful. May be a search engine for WebRings (e.g. look 'Elbereth' on Tolkien WebRing) can be useful (I have to look if there aren't already one).

    Personnally, I use specialized index site (like NewHoo, Linux Life or Freshmeat) when I'm looking for something. Those sites will just have more value in the future, IMHO.

    --
    Fabien Ninoles -- Debian GNU/Linux Developer
  28. Searching ineffectiveness by bbqBrain · · Score: 1

    One of my biggest gripes with the current scheme is that so many people abuse it. I once searched for "XOR gate" and pulled up porn sites. If sites could be indexed based on content type, perhaps this wouldn't be such a problem. Currently, these jokers dump the entire dictionary in a meta tag and waste everyone's time by throwing off keyword searches.

    The sheer volume of websites out there makes effective searches difficult. I imagine a search engine could be tuned for better results, but will people be willing to wait while it crunches through data longer than a shoddy counterpart?

    --

    One of the reasons that I became a lawyer was to avoid ever having to hire one. -SPYvSPY
    1. Re:Searching ineffectiveness by Anonymous Coward · · Score: 0
      I do agree with this for the most part. However, there is starting to become a trend for search engines to attempt to get around this. What is starting to happen is that sites which use the keyword in the body quite alot are considered to be more relevant and given a higher ranking. As well, sites with a large number of keywords are beginning to be considered to be less relevant than sites with fewer keywords. I guess there is an assumption that the fewer keyword site is a more specific page than one that covers a wide range of topics or is just spamming.

      Now this creates some problems though as we move away from just plain vanilla HTML. These days the chances are pretty good that you won't just be using HTML in your site. You will probably be using some DB system for dynamic content, cgi, includes, scripts, etc. Not to mention the various multimedia content that is out there too. Flash is a great technology for creating a multimedia enhanced site but good luck indexing that.

      I think that dynamic content can be gotten around in a number of ways just by configuring Apache (you are using Apache right?) in new ways to allow for standard-looking urls which are actually dynamic pages.

      What is a greater problem is that more and more people are adopting new(er) technologies for the web which can't then be indexed by a traditional search engine. Flash is a prime example. Using Flash I could create a multimedia page with an interesting interface (without having to resort to ugly HTML kludges and what-not) but it won't be indexed. What needs to happen is greater support for newer tags (ie, <objec>) which allow for specifying content to be displayed if the actual content can't be. This allows for great backwards compatibility but it also may solve the indexing problem too.

  29. Search Engine to Search Engine protocol by Anonymous Coward · · Score: 1

    Since most dynamic sites provide their own internal search engines, it seems that a standard Search Engine to Search Engine protocol could help ease this problem.

  30. I seem to be able to find what I need... by Gogl · · Score: 2

    It seems to me that if anything, the internet is MORE searchable then it used to be. I remember some statistic about how a couple of years ago the few search engines that were around only got some small percentage of the web covered anyway. These days it seems the search engines do a better job, and there are a zillion more search engines and also tools that let you search multiple search engines at once. That and the fact that there is just plain a lot more stuff on the net. Back a few years ago, if you searched for Cervantes, the author of Don Quixote, you might find a page or two on some college webpage somewhere, if you were lucky. These days there are enough pages out there that you're bound to find at least one of them that's halfway decent. Anyway, to summarize, keyword searching still seems to work for me. I think that the only way it will get considerably better is when true artificial intelligence is possible. That way, when you ask the computer to find something, it is actually smart and goes out and finds it like a real person. However, it seems to me that true artificial intelligence is a way off....

  31. Indexing dynamic-content sites. by Crafack · · Score: 1

    Today, there are two methods used when a site is added to a search engine database. The first relies on information submitted by the site, the second relies on information (e.g. keyword fields) found by crawlers. As more sites switch to dynamic content, the sites offer no easy way for a crawler to find information about content. This could be solved by developing some method for storage and retrieval of the data. For an example, look at how the "robots. txt"-mechanism works. /Joakim Crafack

    --
    ... Elecance is left to the implementors.
  32. There are just too many things ... by Anonymous Coward · · Score: 0

    .. just too many things that you can save in databases and publish on the internet in dynamic pages .... The internet IS unsearchable because of the amount of data and not because some websites cannot be searched by searchengines (including these would increase both signal and noise and not help). What is IMHO needed more is some way to differentate real information from fluff and spam .. but that is still far away ... (hoping for some advanced AI)

  33. We're not there.... yet by jw3 · · Score: 3
    For my own purposes there is no trouble of finding information on the Net - google, Altavista and a few specialized databases are good enough for me, independently whether I look for a pidgin-English dictionary or a protocol for AMV reverse transcriptase. At worst you find a link to an index page with "interesting links" or something. Basic IQ and knowledga about how the search engines work is enough.

    Still, I see a potential threat in information becoming unmanagable, and, most of all, ways of finding information being abused (like using unrelated keywords just to get some visitors). Stanislaw Lem, the polish sf-writer described this situation in many of his books - starting with the 60s, when noone was even starting to think about such problems.. Sooner of later we'll have a large branch of computer sciences dealing only with searching information in Internet; searching services are already available, but they are either incomplete, or not evaluated. The latter is the key: and google is the first service I'm aware of which tries to automatize evaluating (by counting links pointing to a specific page).

    There has been a lot of talk about "Internet agents" a couple of years ago (I remember an article in Scientific American...) - could some good soul explain to me how is the situation now?

    Regards,

    January

  34. It's been unsearchable by Mark+Edwards · · Score: 2

    I used to make a decent living as an Information Broker - basically, a trained database searcher for hire. Along came the net, and suddenly everyone with a modem could search for themselves. So I wrapped my shingle up, and stored it away.

    These days, there is so much junk and bad indexing, that I may as well put the shingle back out. Almost any search will find mostly commercial sites, unrelated to the search, or completely useless garbage.

    You almost have to be in a bizarre frame of mind to create a good search term these days.

    Mark Edwards
    Proof of Sanity Forged Upon Request

    1. Re:It's been unsearchable by torgo3000 · · Score: 2

      I've had good luck with my search terms. I always use +"this format" for my searches and make sure the phrase is unique enough to bypass keyword-abusers. Google automatically adds the AND operation. I've found altavista and google to give me the best results, but generally the +"technique" has served me well on any search site.

  35. Black holes by slashdot-me · · Score: 3

    I've done some work on a spider and these are the types of pages I spider:
    html htm asp php shtml php3
    I guess I'll add phtml :)

    Other extensions and urls with query strings are ignored. This is mainly for self defense. There are many, many infinite loops and blackholes on the web and they're hard to avoid. For instance, my spider once got stuck on a server that would return the contents of /index.html for any non-existent path. Also, all links on the homepage were relative (not a bad thing) and one was invalid. The call sequence is below.
    GET /index.html
    found foo/broken.html
    GET /foo/broken.html
    webserver couldn't find path, so returns /index.html
    GET /foo/foo/broken.html
    etc.

    What was the programmer thinking?

    This is just one example of the blackholes that lurk on the web. It was completely unexpected and pretty difficult to detect. What if someone wanted to write a search engine trap? I don't believe there is a simple solution to this problem.

    Ryan

    1. Re:Black holes by phutureboy · · Score: 1

      Shouldn't you index all files with MIME type text/html or text/plain, regardless of file extension?

      Just curious.

    2. Re:Black holes by atom · · Score: 1

      One think you could have done to avoid this is to compute a checksum of the html that you downloaded. You can keep a table/database of checksums of every page that you downloaded so far on a given site. You should junk any page with a duplicate checksum.

      Another thing you should do is record each url that you download and make sure that you don't download the same url multiple times.

      One way that sites screw this approach up is to append a unique session id to each url. You might need to keep track of the sessionid or else you might get into an infinate loop of downloading the same page, but with a different sessionid. The checksum thing might get around this problem.

      I'm also writing a spider, but the emphasis is on indexing dynamic pages. (product pages at ecommerce sites).

    3. Re:Black holes by sxxw · · Score: 1
      Search engine "blackholes" are actually fairly common, either those deliberately created by someone who wants to trap spam harvesting bots, or accidentally, through dynamically generated content or the like.

      However, using the URL is not necessarily the way to avoid this. There's no written rule on how the path section of a URL translates to a query, and its possible to create dynamic content that never uses ? operators. Similarly, there's no requirment on servers to have any correlation between an extension on a URL (such as .html) and the MIME type that they return (which is what you're really interested in).

      To deal with black holes, your best bet is to use some form of depth count on the site that you're indexing - once you've gone down past a certain depth give up. The use of MD5 hashes of content can also help prevent simple recursive trees from being indexed.

      Cheers

      Simon.

    4. Re:Black holes by Anonymous Coward · · Score: 0

      In fact this is similar to a trap e-mail harvesters and contaminate spammers e-mail databases. I know their is at least one tool that dynamically generates webpages full of spoof e-mail addresses. If the robot follows any link, it is served another completely bogus page. And so on, and so on . . . In order to fall into the trap you have to ignore the robots.txt file. Bad actions = bad consequences.

    5. Re:Black holes by tagish · · Score: 1

      I would say the onus was on the spider writer to deal with broken HTML, loops, traps &c. I wrote this in response to another posting as a little demo of a spider trap. Although a human visitor will rapidly get bored it's not trivial to see how a robot could reasonably avoid being sucked in. The only general solution I can see is to get the spider to request human mediation if it finds that it's indexing a huge number of pages all from the same domain.

      --
      Andy Armstrong
    6. Re:Black holes by Wokan · · Score: 1

      Have you considered adding .txt to that list? It's certainly static enough for indexing.
      Digital Wokan, Tribal mage of the electronics age

    7. Re:Black holes by paulschreiber · · Score: 1

      You're missing a lot of valid extensions, like stm (for microsoft servers doing server-parsed), cfm (cold fusion). You could also include perl (cgi, pl), python (py), and whatever other crap is out there.

    8. Re:Black holes by andyschm · · Score: 1

      Creation of a symbolic directory link into it's own directory can also create black hole - not due to any error. In fact such a feature can be useful for cookie-based session information storage since a cookie can be configured to be sent only when the browser is in a certain subdirectory.

      Also with Apache webserver one can 'PorceType' a directory to actually be a cgi/php program and when the program loads it can analyze the directory structure as the source for its dynamic arguments - thus a dynamically generated page can be based only on directory information - so excluding a url based on normal ?blah=blah cgi variables is not sufficient.

      --
      A W S ----------- QABO : BALA
  36. Domain Names are the Kludge to the Problem by Ron+Bennett · · Score: 1

    Many companies, especially startups, are turning to using catchy domain names as the way to promote their site and products. Even many non-profits and research groups now register domain names that reflect what they do since many people just type in domain names into their browser - and ironically having a domain name actually helps in being indexed by some search engines; one may debate if this is good or bad, but it's a reality.

    Until there's a standard, the search engines will continue to miss more and more of the sites out there. XML may be the answer to indexing and exchanging data. However, on the bright side, the difficulty of finding data makes censorship much more difficult for the censors - and that's a good thing.

    1. Re:Domain Names are the Kludge to the Problem by Anonymous Coward · · Score: 0

      Things like www.bobsfantasyhouseofbarbeque.com...I work with a hosting procider and some of the domain names these people are coming up with. The only thing funnier is when they try to abbreviate, www.qqqfudosgirinc.com. I mean it does help getting indexed but then when you get to the page and it looks like it was made in publisher, because it *was*...ok I'll stop bitching now.

    2. Re:Domain Names are the Kludge to the Problem by Logolept · · Score: 1
      The problem is much greater. We have too much information, but not enough data.

      The information exists in some form or another (php, asp, xml, html), but making it findable is extremely difficult as this Ask Slashdot defines.

      I'd think a better solution would be to organize inforamtion from the get-go. Unfortunately, for something like this to work, there would need to be universal standards. XML might be able to bring this -- yet the compliance on one type of DTD is necessary.

      Take press releases for example. Currently, it would be fairly easy to aggregate those as they are relatively standard to begin with. Add in product descriptions, datasheets, etc. And it becomes much more muddy.

      Wouldn't it be great if information was categorized by those who know best (the author) and then aggregated later?

      --
      _________________________________ he who laughs last is at 300 baud
  37. While they are nice we should already have more by slashdot-terminal · · Score: 2

    Just stop thinking that tera\bytes are the limit. Get more hardware and more computers. Create petabyte databases. In fact have millions of petabyte locations world wide and create a series of multipetabyte databases that one can use.
    Categories are nice but some (most) sites are personal sites and these sites chage quite often in subject matter.
    While the categories are nice we should have a community planned and maintained categorical system along with a plain text search. Have identifier tags that go along with every web site and then have a standalone and a web based version of this program which will allow for anyone to create a hierical listing of anything according to tcertain tastes and peramaters.

    --
    Slashdot social engineering at it's finest
    1. Re:While they are nice we should already have more by greenrd · · Score: 1
      ...we should have a community planned and maintained categorical system

      That sounds a bit like the Open Directory. Anyone can apply to become a volunteer editor, and any editor can join in the discussions on categorisation (or ontology as we tend to call it). While it is, admittedly, owned by AOL and the paid staff have the final call, in practice they let us make the ontology decisions by consensus most of the time. It's a cool project to be part of!

      according to certain tastes and peramaters.

      It would be quite easy to make a customisable Open Directory, which learns from your personal preferences, based on the freely available ODP data (yes, free! It's 100% Open Content!) at http://dmoz.org/license.html

  38. Two-level structure by Kaa · · Score: 2

    I think we are already looking at a two-tiered structure: there are sites (that could be found through standard search engines) and then there are databases/archives inside those sites.

    It is getting more and more so that to find an answer to a somewhat obscure question, I need first to find major sites on the topic, and then do a search through their databases or mailing list archives. I believe this reflects a real-life structuring of the Web and will have to be taken into account by next-generation search engines.

    Kaa

    --

    Kaa
    Kaa's Law: In any sufficiently large group of people most are idiots.
  39. Centralized Searching is the Wrong Approach by douglass · · Score: 1

    I don't think the issue has to do with the ability of centralized search engines to index dynamic pages. I think there is a more fundamental flaw in that idiom.
    The lists of problems that exist for centralized search engines goes on and on: dynamic pages (of course), missing/broken/changed links, getting to new pages, and so on.
    What I think could be done is to define a search protocol (perhaps through some kind of search://domain/search+terms method) that is standardized. The global search engines then search by determining the most likely sites to have information for you and querying those sites directly for information. This would fix the problem of broken/missing/changed links being reported, new pages would automatically be available (assuming sites updated their search engines quickly), and if the local search engines are integrated with dynamic page generators (which should be possible) than those pages could be searched too.
    I realize that a lot of work would be needed to be put in to this in order for it to work. A protocol would need to be developed, as well as servers for the protocol. Search engines would have to learn to efficiently decide which sites to query to complete their searches, etc.
    Perhaps a combination of both approaches could yield something even better. All I know is that what is out there right now, well, fails miserably.

    1. Re:Centralized Searching is the Wrong Approach by flux · · Score: 1
      Perhaps this would work:
      - Allow web-servers to receive and reply via udp too
      - Define a protocol for queries (could be something as simple as querying for /cgi-bin/search?foo, but it should be the same everywhere). Perhaps with an extension that client could request a reply always or only in cases when something of interest was actually found.
      - The server could forward the query to other servers it sees fit - thus companies with many web-servers could make their main server to ask every server in the company -> one query would sweep the whole house.
      - Make a list of web-servers that are capable of replying to these requests. I guess the list could be either 'strobed' or done more conventionally with spiders and companies telling about them to the server.

      Now, when someone wants to search something, he does a query to a server that sends the query to all the servers, which could send the reply directly to the client. I guess this would involve a java-client so everyone could easily use it - of course, native clients would be nice too.

      I see the problem with the query-server choking its tube with sending udp-requests to many sites, which is why they should use the already mentioned subquerying on hosts all around the world (which all would send the replies directly to the client). The traffic would be basically only sending with the search server. Unless, of course, the traditional search engine was also used - this would remove burden of udping gazillion little sites.

      Of course, nothing would prevent anyone from sending a query to a server. This approach also makes life difficult for firewalled people. Plus you could get MANY answers, as no server knows how 'good' answers you've already received. But hey, it'd be nice to see in action, worry about these problems later :).

    2. Re:Centralized Searching is the Wrong Approach by Grail · · Score: 1

      What you're talking about is search brokering. Basically, you have one server which acts as a "broker" (or a Meta-Search engine, if you will). It sends queries out to search engines and collates the results.

      The difference between brokering and meta-searching is that each of the search engines in a brokered system categorise and rank their results in a consistent manner.

      Thus the brokering engine can return you a list of results that is meaningful.

      The protocol can be simple HTTP. Instead of indexing a remote site, you just call a standard URL such as http://site.name/cgi-bin/broker-server?...

      The arguments/parameters for this search could be based on the fields used in Dublin Core (or just skip to RFC 2413 - Dublin Core Metadata for Resource Discovery). However, Dublin Core is quickly being converted into a really complicated library-style cataloguing system. Perhaps something else exists that suits the purpose.

  40. Static gateways? by sufi · · Score: 3

    One way round the search engine missing query URLs is to write to static pages for the purpose of submitting it to search engines, there are many clever ways of having truely dynamic sites without the need for long urls, you just have to put some effort into it.

    Search engines not picking up on php3 is a bit worrying though, all my sites are written purely in php3, although I never seem to have any problems with getting listed.

    Gateway pages are a good way of making sure you get listed with the keywords you want, although they aren't very dynamic and unless you get really clever don't tend to reflect the contents of a regularly update site... however it seems to me that you can only really hope for *a* listing these days, not an index of all of your site.

    Even google has a 3 month disclaimer on it's submit page, that's a mighty long time if you are looking for support on a brand new motherboard.

    LASE seems to be the way to go... subject specifc full text indexes which spider regularly and can index specialised data keeping it up to date.

    However you would still need a search engine to find a LASE that will get you what you want, but at least it's a bit more structured!

    There are many ways round the search engine problems, and keeping on top of it is a full time job, Submit-it doesn't come close, that hasn't changed in the past 3 years, Search engines however have!

    IMO a combination of all of the above will get you where you want. Keywords and Meta Tags still count, and you have to be persistent.

  41. fault bad browsers and no index of quality by _Doug_ · · Score: 1

    In my opinion, part of the fault lies with the browsers, which poorly handle caching dyanmic content, regardless of whether it is on a remote webserver or a local drive. I for example am forced to add a useless query string to the end of local file URLs so that all browsers will work. Browsers are notorious for ignoring no-cache pragmas and expiration dates.

    The most common way though people find out about worthy dynamic content sites I think is word of mouth. We could use more forums and link referrals to share websites we have found useful. This has the very distinct advantage over search engines of providing a better filter of QUALITY of information. After reading someone's recommendation of slashdot or an article elsewhere, I won't have to hurdle 19 irrelevant hits to get there.

  42. The Open Directory Project by side_ways · · Score: 2

    The Open Directory Project, managed by dmoz.org, is an open source effort to create an organized index of the internet through volunteer work. Currently their are 20,000+ volunteers working on the project. This is a way cool idea that we should all support.

    1. Re:The Open Directory Project by ragnar · · Score: 1

      Let me say first off that I agree with you, but I am a bit miffed that they turned me down to volunteer for them to add resources to the Solaris section. I figured running a prominent Solaris news site for over a year qualified me, but they declined. Go figure... Indexing by intelligent people is still probably the best solution.

      --
      -- Solaris Central - http://w
  43. XML? by Matt2000 · · Score: 2

    I read a while back that meta data for sites would eventually move to an XML based standard which would accurately describe the content of the site?

    Whatever happened to that? I don't mind all that much being taken to the front page of a site if I know that site has the information somewhere in there, I just hate having to hit seven sites to find that one.

    Hotnutz.com

    --

    1. Re:XML? by sxxw · · Score: 2
      You're thinking of RDF, the W3C's language for embedding metadata information into XML (and by extension XHTML) content. This is great for page specifc information (such as Dublin Core metadata), and can also be used to provide metadata information about collections (such as a set of web pages, or an entire site).

      However (there's always a however) there's the metadata catch. If you divorce metadata from content, then it becomes easy for site admins to lie in their metadata in order to attract vistors. Remember the keywords spamming that used to occur? Now, imagine if thats extended to being able to lie completely about the content of an entire site. Unless you're in an environment where you can trust the providers of your metadata, by and large you're in trouble.

      Cheers,

      Simon.

    2. Re:XML? by Anonymous Coward · · Score: 1

      You may be thinking of The Dublin Core.

  44. Challenges for searching the web..... by chirayu · · Score: 2

    I have been thinking about the working of a search engine lately and this post just comes at the right time.

    Some of the challanges which will be faced for search the web in the future will be :

    1. Displaying matching URLs as well as links which match the type of content. This is important. If I search for "throat infection" on a search engine..apart from the pages which mention "throat infection" ..the web engine should give me a link to drkoop.com, webmd.com (AFAIK, these sites do not allow search engines to copy their content) and so on.
    Search engines will have to maintain huge databases linking words to categories. And with the proliferation of hte internet the number of sites carrying content and disallowing search engines is going to increase. Search engines need a intelligent way to get around this.

    2. Search engines will need "help" users with their searches. For example if I just search for "throat" the search engine should have a helper section where it can ask me more...whether I am searching for "throat infection" or "study off throat" and so on.

    3. Search assisted by humans. This is also one of the concepts picking up these days. Basically you submit a question and there will be some person searching the web, and you will get you answer in a few hours/days. Chk out www.xpertsite.com.

    4. Tools for better maintenance of bookmarks. I for one usually bookmark all relevant stuff and then I spend a full weekend arranging them so that I can find the relevant stuff from the bookmarks quickly. The current bookmarking scheme is very primitive causing a lot of users to "reinvent teh wheel" (searching for URLs which are already bookmarked).

    Phew!

    I'll jot down more thoughts later. Gotta work now.

    CP

    1. Re:Challenges for searching the web..... by TheShrike · · Score: 1
      I heartily second the suggestion for better maintenance of bookmarks. Netscape's single-pane bookmark editor just doesn't cut it. A two-pane editor (folders in one, URLs in the other) would be a vast improvment (are you listening, Mozilla?). Then, add the ability to search your bookmarks without openning the edit window, with the results displayed in the browser as links.

      As far as indexing the web goes, until the robots/spiders get much more sophisticated, I don't see much hope. Take a look at just about any web page showing up in the top 10 of an Altavista Search. You'll see that the author has spammed the index. The situation which will continue to cycle is that as search engines become more sophisticated, authors will figure out how to exploit the rules, and spam the index. As much as I like Google, I tend to stay with Altavista, because I can easily exclude sex/porno terms from any search, using the minus sign operator. I have to do this with almost all web searches I do.

      I say index only on META tags, and ignore any over 80 characters. This will help elimate bogus hits from pages which include half the dictionary at the bottom of a page.

      Another thing the search engines could do is figure out how to ignore "trolling" pages. i.e. those which are nothing but index spam, a catchy title, and a refresh tag to ship your browser off to fetch their actual main page (usually porn). Eliminating these "waste" pages from the index will make a search of it faster, and present the user with fewer irrelevant pages.

      --

      --
      If R is the set of all sets which don't contain themselves, does R contain itself?
  45. XML by pos · · Score: 2

    I was just about to ASK SLASHDOT about XML. XML will solve the search problem (or at least help make it better) Working drafts of XML have been drawn up by the W3 Consorium and XLINK, XSL, etc... are coming. There are almost no XML applications available yet though!!!!! most of what is available is in java. This is a field where Linux could be leading the pack, but is instead an example where I think we are lagging behind. (I hope someone can point me to a group that is bringing XML deep into the linux os)

    I want to know if Linux is on top of this. Microsoft has an XML notepad available and I hear that it's going to be all over Win2000 (in the registry even). XML will be the foundation of the new internet and we don't want microsoft to have a technology edge there do we? Perl has XML modules, as I am sure other languages do too (python). Lets get some apps written!

    What about Gnome and KDE? this could help make their projects easier. Especially KDE with all of the object similatrities between Corba and XML and Object RDB's. All Config files could be theoretically stored in XML. We need to push this one people!

    -pos


    The truth is more important than the facts.

    --
    The truth is more important than the facts.
    -Frank Lloyd Wright
    1. Re:XML by Digital+G · · Score: 2

      take a look at http://xml.apache.org


      --

      End Transmission....
    2. Re:XML by Simon+Brooke · · Score: 3
      This is a field where Linux could be leading the pack, but is instead an example where I think we are lagging behind. (I hope someone can point me to a group that is bringing XML deep into the linux os)

      Not so, fortunately. A certain very large telco (which I'm not yet allowed to name) is now running its Intranet directory on an XML/XSL application which I've written. The application was developed on Linux and is currently running on Linux, although the customer intends to move it to Solaris.

      My XML intro course is online; it's a little out of date at the moment but will be updated over the next few months.

      XML and particularly RDF do have a lot to offer for search engines - see my other note further up this thread.

      --
      I'm old enough to remember when discussions on Slashdot were well informed.
    3. Re:XML by pos · · Score: 1

      As DigitalG mentioned there appears to be an xml.apache.org and it is good to see that you sucessfully implemented an XML app on linux. These are good starts. Perhaps you can point me to a resource where I can find applications (in the XML terminology) or schemata for things like configuration files of intel systems, linux apps, etc...

      It seems the only languages that have been written in XML are things like MathML, CML and others. We can use the XML meta language to describe a linux system. All wee need is to find the common elements and represent them in an XML language. It will self evolve from there.

      Maybe I'm missing something but, I think this is really important. Is it harder than this? (I'm sure) Is it not worth it?

      -pos

      The truth is more important than the facts.

      --
      The truth is more important than the facts.
      -Frank Lloyd Wright
    4. Re:XML by Anonymous Coward · · Score: 0

      Try checking out www.everything-xml.com
      There's plenty on XML apps there. Some, with some work, could certainly help the searching/indexing problem.

  46. You mean like Sherlock? by skew · · Score: 2

    The problem with dynamic content is that you pretty much have to query the target web servers at the time the user enters the search request.

    One solution that attempts to address this is Apple's Sherlock. It uses XML to pass queries to web sites and return results. There are certainly some limitations: you have to choose which web sites you want to search (although this isn't always a bad thing), these web sites have to support Sherlock queries, and it only works on the MacOS. Currently lots of big name and Apple-specific sites support it.

    The dev info at Apple is pretty clear though. It wouldn't be difficult for others to create clones for Sherlock that either work over a web interface or on other OSes too. (dunno if Apple could...or would... make any claim against this)

    Scott

    --

    You can't study the darkness by flooding it with light. --Edward Abbey

    1. Re:You mean like Sherlock? by Disconnect · · Score: 1

      Sherlock exists for *nix as well - FreshMeat has a page on it...
      /*He who controls Purple controls the Universe. *

      --
      www.gotontheinter.net
      Updated vaguely once a whenever, maybe once a whenever-and-a-half.
    2. Re:You mean like Sherlock? by skew · · Score: 1

      Cool. I should have assumed as much and checked freshmeat first... :-)

      --

      You can't study the darkness by flooding it with light. --Edward Abbey

    3. Re:You mean like Sherlock? by Anonymous Coward · · Score: 0

      Mozilla uses sherlock files in its Search the Internet window. You can just download a .src file and a thumbnail, place it in the proper directory and it shows up in mozilla. (1 day to the Mozilla alpha!)

  47. Is searching DBs really necessary? by LordSaxman · · Score: 1

    No technology is going to read your mind - you're limited by language, and that can be interpreted and misued in multiple ways. This includes searching (e.g. keywords in porn sites) applications. Word misuse will never stop (ask Plato or Burke) so we're just going to have to deal with it.

    Eventually, the *end user* has to do the infromation filtering, so you might as well take what you can get FAST so you can move on if you don't see what you need. Indexing every database or dynamic page on the web would slow down engines to a crawl. Do you honestly want Altavista bringing up books from Amazon, companies from the Thomas Register, and patents from the USPTO? There's no need for this. If you want specialized information, go to a specialized source.

  48. Spider traps ... by charlie · · Score: 4
    Many years ago (1994? 1993?) I wrote a web spider. (Crap back end, though, so I dropped it. The bones are on my website.)

    Some time later, it occured to me to try and monitor the efficiency of web indexing tools using a spider trap.

    The methodology is like this:

    • Write a perl module (or equivalent) that generates realistic-looking text using Markov chaining based off a database. Text generated should be deterministic when seeded with a URL.
    • Write a CGI program that uses PATH_INFO to encode additional metainformation. Have it eat the output from the text generator and insert URLs that point back to itself, with additional pathname components appended.
    • If the spider follows a link it will be presented with another page generated by the CGI script, containing text generated by it in response to a hit that differs in a repeatable manner from the text in the original page.
    • Child pages should contain links that point inside the web site; you could do this by making the CGI program the root of your "document tree". Better yet, run multiple virtual servers and include URLs bouncing between the domains -- all of which are mapped onto the same script.
    • Stick this thing up on the web and wait for the crawlers to come. They will see a tree of realistic-looking HTML with internal links, digest, and index it.
    • You can now analyse your logs and monitor the robot's behaviour (e.g. by changing the type, frequency, and destination of links your text includes). You can also search the search engines for references back into your document tree and work up some metrics to measure just how accurately it's been indexed (e.g. by re-generating the text of a page and feeding it to the search engine and seeing what comes back -- which words are indexed and which are ignored).

    Anyone done this? I'm particularly interested in knowing how spiders handle large websites -- have been ever since I was doing a contract job on Hampshire County Council's Hantsweb site a few years ago and caught AltaVista's spider scanning through a 250,000 document web that at the time had only a 64K connection to the outside world. (Do the math! :)

    1. Re:Spider traps ... by Anonymous Coward · · Score: 0

      I know that companies do this sort of thing. Of course, if you have thousands of hits on your webpage, a scaleable spider entrapment algorithm could be quite interesting...

    2. Re:Spider traps ... by tagish · · Score: 1

      Like this for example ;-)

      --
      Andy Armstrong
    3. Re:Spider traps ... by Nygard · · Score: 1

      Considering the disparity between your own bandwidth and that of any major search engine's spider, I think this would be an excellent way to conduct a denial of service attack on yourself.

      --
      "Genius may have its limitations, but stupidity is not thus handicapped." --Elbert Hubbard (1856-1915)
  49. It's not obsolete... by Seth+Scali · · Score: 1

    I think that, for the most part, the databases are doing their job rather well.

    Where do you find the most dynamic content? News sites. Slashdot, Freshmeat, Linuxtoday, Yahoo! News, etc. These are the sites that need dynamic content.

    Ironically, these are the exact sites that search engines are pretty much not interested in indexing, anyway. Even assuming that a database can update all its sites once per day, that means that the information is a day old-- centuries, in Slashdot time! People don't go to AltaVista to search for the story over at ABCNews.com. They go to AltaVista to find information about international child custody laws (to name a random hot issue of late).

    Most of your general information stuff is pretty much static. This is what the search engines look for anyway-- this is the stuff that doesn't change often, so it's good stuff to record. Why would anybody bother to make a page about Cup 'O Noodles that's generated through a Perl script? It's too tough, and can be a huge pain in the ass to change it.

    Why index the pages that are constantly changing, when the stuff you're looking for (by definition) doesn't change much? Sure, there's overlap (small sites that use generate the exact same content every time). But it's such a small segment that hardly anybody would miss it (yes, it may be important, but not important enough to totally revamp the indexing procedure).

  50. Indexing dynamic content (was Re:Customers :) by Simon+Brooke · · Score: 1
    Is it even possible to index dynamic pages? They don't really exist until the page is generated.

    Yes, for a very large category of dynamic pages, it is. For example, in an online shop, the actual number of a particular product in stock at the moment may very from minute to minute, the price of that product in the user's preferred currency may change from week to week, but the product itself doesn't change much over months over months or years. It makes perfect sense to index the product page, because although some of the contained data may be transient, a great deal more is not.

    Or take another example: the weather forecast for a particular area. The forecast itself may change regularly, but the page always contains a current forecast and that fact is worth indexing. The best technology available for this sort of thing is probably RDF and the Dublin Core metadata specification. Of course, the search engines still have to be persuaded to take heed of this...

    --
    I'm old enough to remember when discussions on Slashdot were well informed.
  51. Two ways things might go by jd · · Score: 2
    There is nothing to stop the web server from sending different pages, depending on whether it's a regular user or a web crawler.

    Therefore, it would be entirely feasable to have a system in which regular users saw regular pages and web crawlers saw a "static" index page, all at the same URL.

    This would allow web crawlers to index according to genuinely useful keywords, rather than by how the crawler's writer decided to determine them.

    An alternative approach would be to distribute the keyword database. Since all the web servers have the pages in databases of one sort or another, it should be possible to do a "live" distributed query across all of them, to see what URLs are turned up.

    This would be a lot more computer-intensive, and would seriously bog down a lot of networks & web servers, but you'd never run into the "dead link" syndrome, either, where a search engine turns up references to pages which have long since ceased to be.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  52. The searchers and the searchees are ever-changing by Zigg · · Score: 3

    I think the real problem with searching really isn't that the Internet is growing too large. The central problem with it being too hard to find information is due to the unfortunately ever-changing nature of HTML. (Yes, I know there are much better solutions out there -- I work with some of them on a daily basis. However, we seem to presently be stuck with HTML and its variants.)

    It's a self-feeding monster, whose typical cycle goes as follows: SearchEngineInc (a division of ConHugeCo) creates a new technology that really impresses people with its ability to find what they want more quickly. (Right now SearchEngineInc is probably Google, at least in my view.)

    Once the new technology takes root, content authors (well, maybe not the authors so much as their PHBs) note that SearchEngineInc doesn't bring their business (which sells soybean derivatives) to the top of the search list (when people type ``food'' into the search engine). Said PHBs make the techies work around this ``problem'', and all of a sudden SearchEngineInc's technology isn't so great anymore because the HTML landscape it maps has changed.

    A similar situation occurs when PHBs think their site doesn't ``look'' quite as good as others. (Insert my usual rant about content vs. presentation here.) Whether via a hideous HTML-abusing web authoring program, or via all sorts of hacks that God never intended to appear in anything resembling SGML, the HTML landscape is changed there as well, and SearchEngineInc's product becomes less effective.

    What's the solution to this? I'm not quite sure. Obviously there are better technologies out there that are at least immune to PHBs' sense of ``aesthetics'' but I would wager few of them are immune from hackery. I'd say that search engine authors are doomed for all time to stay just one step ahead of the web wranglers. At least it assures them that their market segment won't go away any time soon. :-)

  53. Unsearchable? Possibly... by Shadowcat · · Score: 1

    I have to say, yes. I believe that with the way the internet is growing, it's difficult to keep up with new pages and new technology. I know there have been several times I have done searches only to turn up nothing when I KNOW it's there or to turn up too much which pertains to nothing I'm looking for. Most of the more mainstream search engines have become obsolete, I'm afraid. Many of them use methods that just simply aren't practical like searching for certain words in the text of a page. When you search for things like that your searches will not be accurate and often you'll get information you don't really want or need.

    So, I believe the internet is outgrowing the current search engine technology.
    -- Shadowcat

    --

    kageneko@kageneko.net

    "I can roleplay. I can frag. I can PK while you lag."
  54. Computer indexing too primitive (for now...) by Outlet+of+Me · · Score: 1

    I think it is just the plain and simple truth that the searching algorithms all of the search engines use currently are not suitable for the task. I will perform searches on what I think are pretty obscure terms and return >10,000 hits on some of these search engines. Of course, none of them mean anything to me.

    I'm not saying that this problem won't be figured out at some point. It's going to take a little more technology than we have right now, but no doubt it's on its way even as we speak. (Any AI experts out there? :)

    Until then, indexing by hand seems to be the only 100% solution. Humans are fallible, but much less than the machines are at this present stage. Plus, directories geared towards specific topics would help narrow down your search before you even start searching.

    1. Re:Computer indexing too primitive (for now...) by Anonymous Coward · · Score: 0

      Yes, that is why Open Directory is a good resource. the open source, human edited search engine: dmoz.org superant@usa.net

  55. Hide the query string by grinder · · Score: 1

    There is no excuse for having a purely database-driven website that does not appear to be straight HTML pages. If you have ?s everywhere then you're just lazy.

    Firstly, even though you might pull everything out of a database, a large per cent of all such content is not really all that dynamic, which means you're probably better off precompiling the base down into static HTML, and recompile the page only when its content changes.

    Secondly, if you have a script with a messy query string you can turn it into something that doesn't look like a script at all, e.g., /cgi-bin/script.cgi?foo=bar&this=that could be presented as /snap/foo/bar/this/that.

    With Apache, you would just define and pass it off to a handler, that would pick up the parameters in the PATH_INFO environment variable. If people tried URL surgery, you could just return a 404 if the args made no sense.

    Search engines are your best (and probably only) hope of getting people in to visit your site. It's up to you to make sure your URIs are search-engine friendly. If they can't be bothered to index what looks like a CGI script, well that is your problem. There are more than enough pages elsewhere for them to crawl over and index without bothering with yours.

    1. Re:Hide the query string by Anonymous Coward · · Score: 0
      If they can't be bothered to index what looks like a CGI script, well that is your problem.

      I'm astounded to see this attitude--hopefully it's not shared by most /.ers. "Even though you, the web author, have followed established standards, search engines have arbitrarily chosen to ignore some of those standards. Yet this is your problem to work around."

      Compare with:
      "Even though HTML is standardized, this page uses non-standard tags. If your browser doesn't support them, that's your problem." (This attitude is all too prevalent on the net in general, but I hope that /.ers have a better understanding of why we have standards in the first place.

  56. False positive hits. by Lord+Kano · · Score: 3

    It disturbes me that so many pron sites have hidden in their html code (and sometimes not even hidden) huge lists of adult film stars just to get hits from search engines.

    If you do a search for Cortknee or Lotta Top you'll get a bazillion hits and 90%+ of them are "Click here to see young virgins having sex for the first time on their 18th birthday!"

    As we all know, but nobody likes to admit, pron is the fuel that makes the net go 'round.

    Many other sites have taken hints from the pron people. I'm sure that it was a deal of some sort, but everytime I do a search on metacrawler there's a line to search for anything I get a like to search a certain bookstore for books on the same topic.

    Commercialism and shady practices are what are making the net so hard to search.

    LK

    --
    "Hi. This is my friend, Jack Shit, and you don't know him." - Lord Kano
    1. Re:False positive hits. by cleopatra · · Score: 1
      I completely agree with you... it is annoying to get a false positive when you are searching... especially if you are like me and always in a hurry. *laughs*

      I think the important thing to realise though, is that the web is not unlike many other forms of media (and really... people are doing this with their sites, because they want attention --whether it is just to read their content or buy their product):

      • When is the last time you picked up a magazine (almost any one will do) and you didn't have to finish reading your 10 page article by flipping through 5 pages of advertising first? (And it is certainly not always relevant advertising!)
      • When is the last time you read the newspaper, and mangled in with the business section are ads for upcoming theatre events or (and my city's newspaper is famous for this) put the stock quotes or movie listings in the middle of the classifieds...?
      • When is the last time you were watching TV and a commercial came on for a new show or movie... showing you the 2 minutes of amazing special effects or steamy sex scenes... and you go to the movie only to find out those were the only 2 minutes?

      I'm not picking on you specifically... because I agree with you... I just wanted to point out how this sort of thing was inevitable.

  57. mod_perl should be helpful to you by Anonymous Coward · · Score: 0

    with mod_perl, you could create a system that analyses the URL requested, and makes a database query. You could hide a database behind something like: www.webzine.org/articles/section/109944.html on your server, no actual file called 109944.html would exist, but the request of that file would tell your server to query the record 109944.html from the database.

  58. Searching... by Pollux · · Score: 3

    Okay, I just got done with my research paper for college last week, and although I can pull a paper out of some orifice of my body, researching is always a pain.

    Our library has a wonderful online database where you can type in keywords and search for them, but the keywords only look as far as the Title, Author, or abstract of the book. If you wanted to look up some narrow topic, you can't expect that there's books written exactly on that topic, but there's always bound to be a few books out there that have a few pages dedicated to that subject (but isn't listed in the abstract). So, what do you do? You have to get your hands dirty.

    My topic: Holy Wisdom (I won't bore you with details, but just stick with the subject). Looking in the online database, I find that there are zero books on the subject. Darn. Let's do some lookin...

    After I read in a few Religion Dictionaries, I find that Holy Wisdom is also called "Sophia." I go back to the catalog, type in "Sophia," and I get one book. I skim this one book, and find that Sophia has sometimes been associated with the Holy Trinity. So, I go back to the catalog, enter "Holy Trinity," and BOOM, I get back 400 results (anyone seeing a similarity here...). Let's limit them...we'll search within the results for "History of," and I get back about 11 results. I read the abstracts, find a few books of interest, and start skimmin...

    ...Well, whadda know, there's a page in one book that talks about Sophia, and half a chapter in another book that talks about Sophia as well. There's a few more sources for the paper!

    Now, for those of you who just don't understand what I'm trying to say here, just read from here on, cause here's my point: Computers aren't smart enough yet to "guess" at what we want, and personally, I don't think they ever will. Internet keyword searches are just like asking someone to help you who has no idea what your topic is...they can only search for what you ask them to search for.

    Internet keyword searches are a hastle, and many times the first few returns won't be anything CLOSE to what you want (search for "Computer Science," you get back porn, search for "Linux," you get back porn, search for "White House,"...). But if you learn how to dig, like the people who lived fifty years ago WITHOUT Boolean Searches, you'll find what you're looking for. Sometimes, it's just like searching for a topic...you might not find anything directly, but you can't sum up an entire book in just a paragraph either!

    Try some links, look around, and it'll be there!

  59. PHP / Dynamic Pages Are Indexed by waldoj · · Score: 2

    Many of my sites are database-driven sites that run on PHP and MySQL. No problem with indexing, and no problem with the file extensions.

    If you can get beyond the backend concept of a dynamic page, most pages really appear to be quite static, from an indexing perspective. A http-based indexing system (as opposed to filesystem-level) can't tell that pages are dynamic, and don't care.

    I've never had a problem with search engines failing to index pages just because they had convoluted URL. If some engines do that, it's a bloody shame.

  60. yep. by justin_saunders · · Score: 1
    Dynamic pages don't exist until you click on them in your browser either. Search engines *will* follow links to dynamically generated pages.

    The point is there has to be a link there in the first place. They will not be able to index a dynamic page if it is only accessable through a "form" post.

    The way you can get around this is to have a hidden (to users) page on your site with hardcoded (or database generated) links into the dynamic content that you'd like visible from search engines.

    For example, if you have a whole heap of news articles on your site, with one per page, you can make a dynamic page called "newslinks" which, when generated by a crawler, querys the database and writes links to every news article in the site.

    cheers, j.

    --

    "My cat's breath smells like cat food." - The Tao of Ralph Wiggum.
  61. Meta-Engines by Paradox+!-) · · Score: 1

    IMHO we're already seeing the advent of meta search engines that do their own search and then do a simultaneous search using other engines. (Yahoo does this, I think, as does lycos/hotbot) That's a great kludge for these engines to extend their reach, but not a real solution.

    I think we'll see more topic-specific search engines (I use trade rag sites exclusively for really good info on tech news, for example) linked together through the big search engines. The main engine (Google, or whatever) will check the search term to see if that term has been pre-linked by the engine managers to generate a search on a more topic-specific engine (for example a search on "market size" may cause the engine to do a lookup on the northpoint search engine) or engines, and then combine the results of its own search with that of the topic-specific engine for relevant results.

    It's the whole idea of vertical portals taken to the next level. The vertical portals provide topic-specific searching capabilities over the 'Net to the behemoth engines and portals for a fee, or something.

    Remember, the user will not get smarter, but will rather look for the faster and easier solution.

    IMHO.

  62. Semantics Antics by TrueJim · · Score: 3

    I'm going to say a naughty word: artificial intelligence. I'm hoping we soon ( 5 years) get good enough at this "indexing" stuff to create semantic models of Web content rather than purely syntactic models. (Google is a small step in the right direction.) If so, then perhaps dynamic pages can be indexed according to their location (role?) in an "ontology" rather than via the frequency of essentially meaningless character strings. That may sound farfetched, but it seems to me that the Web finally provides a real _financial_ incentive with near-term payoff for that kind of research. Hitherto, the quest has been purely academic. And where there's the lure of a real payoff, stuff often happens quickly (usually -- batteries and flat-screen technologies being notable exceptions).

    --
    I hope that after I die the one word people use to describe me is "resurrected."
    1. Re:Semantics Antics by Raevnos · · Score: 2

      Unfortunately, even Google is fast becoming useless - practically every search I do on it results in thousands of mirrors of some page describing some RPM that I care nothing about.

      Search engines act like Lem's Demon of the Second Order right now - returning lots of information, but very little of it of any use or relevance. I've thought a bit about ways to improve it - say, a perl script that queries half a dozen search engines, and uses the pages that appear in a majority of the results, applying simple rules to them, based on markup (Like giving keywords that appear in a heading element higher priority than those that show up in a paragraph) and the number of other pages in the search area that link to them... Not AI, just a bunch of heuristics. Adding hierarchy-based rules (Ie, page B is only found via a link on page A, and A and B have similar URLs, so B might be a sub-page of A, and shouldn't be considered if A passes everything, because you'll get to it from A anyways) is an interesting possibility if I ever get around to writing something like this. I think the same rules could apply to static and dynamic pages, though. No need to treat them differently, aside from result caching.

    2. Re:Semantics Antics by Anonymous Coward · · Score: 0

      Alls I can say is visit http://www.npsis.net for domain answers.

  63. The problem is centalization by TulioSerpio · · Score: 1

    Every Web/FTP server must have a standard, live query engine. Every week or so, some sites would query them, and update their database, but only to the site level, if the end-user want, must query in the site for the page, in a second phase search. [Buen español, bad english]

    --

    I'm from Argentina: Tango, Asado, Mate, Gaucho, Maradona, YPF

  64. This looks like a job for.....XML! by Rombuu · · Score: 1

    Look up in the sky.. its a bird, its a plane, it web sites dumping their information from hard to index databases to easy to read XML!

    Wasn't this sort of thing what XML and RDF were originally designed for?

    --

    DrLunch.com The site that tells you what's for lunch!
  65. less page, more site by matman · · Score: 1

    I think that this is going to force the search engines on focusing on sites rather than pages.

    as a site can be described by keywords even if their subsequent pages are database driven. i like searching by site usually anyways - provided that the site has a nice search engine :)

  66. Don't change the web--change the way we search by Zomart9th · · Score: 1

    The web is growing and changing at a pace that a band-aid fix like static indeces just wont solve. Database-driven web sites are simply more manageable, scale better, and more easily allow the separation of content creation from site design than static ones consisting of n-thousand HTML documents.

    Technologies like XML and WDDX provide access to databases through standard protocols and are not difficult to implement. A few simple, scalable solutions include:

    • Allaire's ColdFusion (for small websites)
    • Allaire's Spectra for *huge* sites
    • Apache in combo with some DB fun for those of you sage enough to use *nix

    DB-Based web content has the potential to make the web more searchable then ever before through hierachy and content classification, but only if we do not try to reign it in. Instead, we should adapt the way we search to the emerging scalable, powerful web architechture that is the future of the web.

    --
    Bryan Klingner, MCSE, MCP+I
  67. Data Driven Sites: an RFC standard is needed by CodeShark · · Score: 1
    I don't think that the 'Net is becoming unsearchable, I think there is no standardization in how to write and/or search a data driven site. When I code a database driven site, for example, I include code which automatically writes the new meta tag information for the content page, and then I submit the new pages to as many engines as I can. But I don't know of any way to automatically get the big sites to delete my old pages and replace them with the new -- the timing of the site submissions appearing in the search engine directories has been highly unpredictable to say the least.

    What I try to prevent is the problem I am going to mention next, which is that it seems like with many data driven sites, the content pages "expire" (i.e., they are aged out of the database -- thus disappearing from the site) without any notification to the search engines that the page is expired.

    As an example, I use a product which performs queries against 10-12 search engines at the same time. For any given search, 10% or more of the pages will be invalid. What little research I have done into the invalid sites often shows that the page has been dead for more than a year -- even when 8 or more of the search engines advertise that they have (at least in theory) spidered the page within the last 60 days.

    What we have here is a problem in search of a standards based solution (an official RFC) designed to bring order out of the chaos.

    My own thought (which I acknowledge are from someone who has been doing data driven sites for less than a year) is that there ought to be a standard way of telling an external spider to use a "site local" index file, similar to how the robots.txt file excludes some or all of a site from spidering (assuming the spider's coders obey the standards -- not all do.)

    It then becomes the data-driven site's coders responsibility to add the added code which updates the robot's index file "automagically" based on the content changes to the site.

    It also seems to me like browsers could access this file to see if a bookmark is still active, and with the proper format, maybe even update the local bookmark file. Something like this:


    1. http://mygreatsite.com/old.html := http//mygreat.com/new.html.
    I'm interested in what more experienced coders have to say about this idea, BTW.

    --
    ...Open Source isn't the only answer -- but it's almost always a better value than the alternatives...
  68. One use for the whole e-Speak shebang? by rise · · Score: 1

    A web site is basically a network service. It seems like there should be a place for a distributed protocol that actually allows an intelligent* search. If you defined a doc/HOWTO type you could search for sites providing those services with criteria that select the particular issue you're looking for. Try that with a search engine and irrelevant juxtapositions will fill your results with noise.


    *Intelligent in the sense that the search method used shares a vocabulary with the providers.

    1. Re:One use for the whole e-Speak shebang? by technos · · Score: 2

      A site-specific search service built around the newly-opened E-speak? Damn good idea.. Not only would it provide an easy interface for searchbots, but in the future it could provide information for user-agents and other client-side searches. I'd imagine it wouldn't inflict as much server overhead as the current system.

      I'll be flipping through the 'E-speak tutorial for the rest of the afternoon!

      --
      .sig: Now legally binding!
  69. About dynamically generated pages... by diediebinks · · Score: 1

    Almost all search engines will reject dynamically generated pages if they have extended characters in the URL (except for Lycos and Inktomi). This is primarily due to the fact they are worried of getting into what they call "robot traps" where there may be no end to the number of links that a script or program generates. If the URL contains a "?", "%" or other similar characters, they will probably not index your site. A work around is to build "Pointer Pages" using regular static html with links to the target page. If you attempt to use the refresh tag within the "pointer" pages, be aware that Infoseek will try to index your targeted page, not the page that you submit. There are ways around this problem...


    (From The Unfair Advantage Book on Winning The Search Engine Wars)

  70. Google by Kvort · · Score: 1

    Google rocks. I can do a search and find all the articles I have ever posted on slashdot. (Archived, of course) The problem of slowness of distribution to search engines is a difficulty, but compared to historical ways of gaining information, what we have is incredible.

    We should have some sort of a standard way of indexing these pages, and if they make it compatible with all the new technologies coming out, I will be very impressed. The best search engines will use the standard indexing in addition to current technologies, I would suppose, but it would still make life much easier to have this.

    Also, if there were a central place to notify that you have posted/changed content. Something like the way domain names are registered in central places. Its in the users best interest to notify the central location that its content has been added/changed, and then the central point propigates its information to anyone who wants it, for a small fee, of course. :)

    Why do I post these things on public forums, anyway?

    >>>>>>>>> Kvort, Lord High Peanut of Krondor

    --
    -Don't mind me, I'm personality-deficient and mentally-impaired.
  71. Distributed open database standard for the web by Anonymous Coward · · Score: 0

    Basically we need an distributed open database
    standard for the web. Searching a database is
    much faster than doing a blind text search and
    should definitely take up less bandwidth and
    resources than a text search. If each ISP and
    independent node on the internet hooked up
    their databases and (imported) html pages,
    we'd be able to search anything, anywhere.

    Of course, implementing it will be tough. The
    current approach of web searching is based on
    laziness. Actively participating in the creation
    of a web index is not necessary. The only reason
    for ISPs to participate is because their afraid
    that spiders eat up too much bandwidth.

    In the mean time, we'll just have to live with
    what we have. As Larry Wall is fond of
    saying, "Laziness is a virtue". I hope that
    enough of us are lazy enough to use plan ol'
    text, HTML, SGML and XML.




  72. Problems with search engines by BoneFlower · · Score: 1

    Search engines have serious problems. One is that boolean strings and other forms of highly specific searching never seem to work. I search for anything, and I get maybe 20 out of 3000000 sites that have what I want. And many of these sites are on the fifth or sixth page. What needs to happen is search engines coming up with a better way of ranking sites. Its really annoying when the 100% relevant site has nothing remotely related to your search, and the 25% site is exactly what you are looking for. Search engines also have to do more to prevent spamming them. Content based searching rather than keyword should be implemented, it can help, but keyword searching, if improved, is still good when searching for specific information. Search engines could focus on specific areas, like a SlashSearch.com would be a tech search engine. The search everything engines could add a new option for their advanced mode searching for category. Database driven sites should use meta tags describing the content type. While no solution can be perfect in a rapidly changing environment like the Web, these ideas can be implemented and would help.

  73. Spidering dynamic content by dvt · · Score: 1
    Spidering dynamic content is not itself a problem. Because HTTP does not provide a "list all files in folder" method, you have to use the same basic approach regardless of the source of the content: start in a root page, extract the HREFs, index those pages, get their HREFs, etc.

    If an HREF contains a query string, sending that query string will return the content in the same way that sending an ordinary www.sample.com/page.html link will return the content.

    Another message mentioned the problem of loops. A table of visited URLs does not always work because of the problem of relative links that get continuously appended to on sites that return index.html for broken links. Two alternatives are:

    (1) limit the spidering depth so that you only go, say, 4 links deep into the site, or

    (2) make a hash value on content returned, and use the hash value to see if you are getting the same content with a different URL. Stop spidering any time the hash value is the same as a previous hash value.

  74. bots by Anonymous Coward · · Score: 0
    We need personal bots that search the internet for us all the time. So I can tell my bot to find web sites containing Heller reference's and store the URL on my 20 gig(or whatever) HD.

    I have alot of interests, but I don't thinf A URL entry for every web site that holds my interest would fill up 20gigs of data.

    Also it should have the option to only search entry points into domains, such as http://slashdot.org but not http://slashdot.org/whatever/more/test.html

    On a similiar note I think the 'WEB' has gotten to a point where web sites need a tag that determins an overall content review, an example would be that porn sites may have this , and a personal site may have tag and so on. so I can click the no porn option on my search engine and not have 500 returns, 450 involving animals....

    Now I use porn as an example, but I don't think it should be removed from the net, but I think I should be subject to it if I want to.

  75. We have the technology to make it better... by underbider · · Score: 1

    So, yes, interms of technology, it is easier to classify webpages into categories, then index them within each category. Check out http://www.cora.jprc.com/ It is a search engine for Computer Science researhc papers. It is in a format that is just like yahoo. But every thing is done automatically!

    So, the technology is here. It is just a matter of time before this kind of thing is neccessary.

  76. DB-Driven Pages with Full, Normal URLs by dschuetz · · Score: 1
    This is actually pretty easy to do. We have three sites (all internal) that all run with a MySQL DB storing all web content, and uses PHP/Apache as the Browser/DB interface. By using an apache "Alias Match" directive, we re-write everything to point to a primary PHP script:

    AliasMatch ^/(.*) /home/web/index.html/$1

    This front-end does a lot of (admitedly, crude) parsing of the rest of the URI line to determine what "document" in the DB to look up, and which subordinate page, or if it's supposed to be instead generating an image, or whatever. The main script also looks up styles for each document, builds navigation bars, etc.

    Works pretty well. Not nearly as flexible as if I'd actually thought it through before writing it, but it fits our needs admirably.

    Why'd we go to such a complicated approach? Because we have bunches of InfoSec engineers, who really don't want to worry about HTML, writing web pages and reports. We've got a GUI front-end with a nice wysiwyg HTML editor that hits the documents in the DB directly, and all changes happen on the "live" HTTP server immediately. It's completely scannable because we use a web-get sort of tool to create a static "snapshot" of the final report before we send it to customers.

    At any rate, I think it's cool... :-)


    david.

  77. why the web is broken. by whocares · · Score: 1

    A friend of mine asked me once to explain my opinion of why the web is broken. After some thought, I came to some conclusions that are relevant here. I'll see if I can restate them effectively. All IMHO, of course.

    A couple of assumptions:

    1) The web is a non-hierarchical, non-linear system. The entire nature of it is actually closely related to how most people think, through a series of links. Ever found yourself explaining to someone how you got from one seemingly unrelated topic to another? The web is the same thing.

    2) Mapping linear, hierarchical systems is what humans are good at. Indices, tables, flowcharts, etc. are all designed to present a certain kind of data in a randomly accessable way. When information is non-linear, we try to force it into this kind of structure, for better or for worse. This is what search engines currently try to do - provide a keyword index to every document on the web.

    We cannot treat the web like something it is not. It is not a book or a collection of books. It is not even linear. It's a lot closer to the repository of information that is the human mind than most things that humans create.

    This presents an information-finding nightmare. Much as it's sometimes difficult to find the piece of information you know you have stored in your head, it's becoming increasingly more difficult, even with the power of algorithmic parsing and pruning, to extract single pieces of information from the system. Search engines are, as the original post stated, becoming obsolete.

    So what is the solution? In my opinion, the most intuitive 'index' type interface to the web has always been Yahoo, which for any given topic will provide a number of starting points. Not every document is indexed, not everything is represented - but if you drill down through links, you are more than likely to find what you're looking for. It takes the natural process of searching the web, which if it were a few hundred nodes could easily be done by hand, and gives it a logical starting point, much as someone can remind you of something you were searching to remember, and suddenly it all becomes clear. Indexing the entire web is as useless as trying to do an entire braindump of your mind. Indexing a set of starting points for using the web the way it was intended - as a series of links - is the only way that will probably ultimately work.

  78. Alternatives to [?|&|phtml|etc.] database calls? by SwellJoe · · Score: 1
    This adversely effects web caching technologies as well. Any dynamic content is uncacheable and unsearchable due to the inability to know if the content is specific to a query or simply a love of the concept of "dynamic content" on the part of the page designer.

    It's clear that many aspects of a webpage that could be pregenerated every time the information is updated are not being done that way. Slashdot is a prime example. Presumably, thousands of people visit Slashdot anonymously everyday. Even though they see the same content, the page is regenerated with unsearchable/uncacheable content. Shouldn't it be a simple matter to have a script choose between a dynamic page for logged in users and a constantly up to date pregenerated page for anonymous users? Saving CPU cycles for the servers, allowing indexing by search engines, and speeding up accesses for users behind a cacheing proxy. Sounds like only good things to me.

    Obviously this won't solve all of the problems, but many websites front pages are the same for every user. Wouldn't it make sense to pregenerate it as static content? This could be taken much further by news sites that provide the same story content to every user, but use a database frontend for simplicity anyway. This doesn't preclude use of a backend database for information storage and organization, but it does impose quite a lot of complexity in the implementation of a system to index all of the pages as they become available and make them into static, numbered pages.

    I tend to fall into the category of folks who believe that site designers should be a little more aware of the outside world and making their content accessible via every possible means. I don't think it makes sense to prevent search engines from finding ones content. If you've put it up, you want people to find it. Why turn down that extra banner display simply because someone doesn't check your headlines and instead searches Google or Alatvista?

    I'm sure there are other issues involved and I'm glad this was brought up...I've been trying to figure out solutions to these problems myself while implementing a company web page with our web designer. Being a cache server company, we've got to make sure our own pages are completely cacheable whenever humanly possible. Not to mention that when someone does a search on any engine we want our URL to come up if we've got something to say on the subject of the search. It just makes sense to be as openly accessible as possible.

    So, is this a problem that should be addressed mainly by the search engines, or should web designers be thinking ahead to such concerns when they are building a site with dynamic content?

    Joe, Swell Technology

  79. Catalogues are a Good Thing(tm) by saska · · Score: 1
    Long live Yahoo (and alikes).

    Markus
    --

  80. Specialized Engines - Not More Engines! by xtal · · Score: 2

    The answer to all this isn't going to come from making existing engines better, nor is it going to come from bigger, badder, faster database engines powered by your friendly clustering technologies!

    The answer is simple: More specialized search engines. You're looking for technical stuff? Then you should be able to search a technical database. Like, if I'm looking for source code to model fluid flows - that's pretty specific already. There's no reason that I should have to wade through all the references to "bodily fluids" that I'll get on altavista for instance!

    Search engine people, take note of this. Classify your URLs into categories - like Yahoo - but come up with some way to do it automatically. Or even better yet, let the users do it, a la NewHoo.

    End of internet predicted. Film at 11. We've heard it before, and we'll hear it again. Just need someone with a little VC money to throw it towards an idea that supports more specialization in search engine tech.

    Kudos..

    --
    ..don't panic
  81. From a web page owner by gargle · · Score: 2

    I use a free site statistics service to keep track of hits to my web site, where I keep some software that I've written. Looking at the referrer statistics to my site, the vast majority of hits are generated from explicit, categorized links to my site (e.g. bookmark pages and surprisingly Lycos which has a categorized database), and rarely ever from general search engines like Altavista. The questioner may be right - from the perspective of a web site owner, general search engines aren't very effective at bringing visitors to my site.

  82. You can.... by Anonymous Coward · · Score: 0

    http://www.navigateone.com OK, so it's only financial information, but it does update itself, work out queries on it's own etc..etc... So it's not impossible. p.s. It's nothing to do with me.

  83. People just don't know how to search. by segmond · · Score: 1

    People just don't know how to search.
    Since, I have been using the internet, I have stopped making daily trips to the library. Searching is an art, The web is pretty searchable, but takes quite some effort, knowing the right search engines to use for what, knowing the right keywords and combinations.

    --
    ------ Curiosity killed the cat. {satisfaction brought it back | it didn't die ignorant | lack of it is killing mankind
    1. Re:People just don't know how to search. by Junks+Jerzey · · Score: 1

      That's true to some extent, but not entirely. If I want to search for a paper on generational garbage collection, then it's straightforward. But searching for many things in a less geeky realm is a disaster. Some reasons include:

      1. Commercial sites are filling their sites with irrelevant keywords in attempts to get hits for advertisers. How many porn sites have hidden text like "Natalie Portman nude!" or references to JFK Jr. or Princess Diana?

      2. Many commercial sites are filled with empty marketing phrases that don't help narrow a search.

      3. There are countless failed businesses from 1997 that still have live sites. I run into this all the time.

  84. XML by debrain · · Score: 2

    IIRC, XML was designed to help alleviate this sort of thing. Unfortunately, XML has not been exploited enough to have any significant ramification on the way the internet is sorted.

  85. Why not use ScriptAlias by johnnyb · · Score: 3

    Why doesn't anyone use the ScriptAlias directive? It does the same thing as query strings, but makes it look nicer, like the rest of the web. You can "say" your looking at a directory or a .html file, but in reality you are viewing a singe script. For an example go to http://store.wolfram.com/. There are no directories on the server side, it's all served off of one script. Yet, to the user, it appears as a hierarchical directory structure, complete with .html files. The only query string is your session id, which is appended to the URL in case your browser doesn't support cookies (however, these are not there if a robot views the site). Anyway, a simple directive like ScriptAlias can save everyone a lot of trouble. If anyone has questions about its usage, send me an email.

    Jon

    1. Re:Why not use ScriptAlias by Ranger+Rick · · Score: 1

      > There are no directories on the server side,
      > it's all served off of one script. Yet, to the
      > user, it appears as a hierarchical directory
      > structure, complete with .html files.

      That may be great for search engine indexing, but how maintainable is the code that serves that? Unless you have everything organized in a database, it's got to be a real pain in the butt to maintain one big monolithic CGI.

      --

      WWJD? JWRTFM!!!

  86. Unsearchable? Always has been! by swordgeek · · Score: 1

    OK, maybe that's a bit of an overstatement, but not much.

    Does anyone else remember searching before the web came into its own? I remember constructing carefully planned Archie searches only to often find either no results, or pages and pages of 'em. After a while (and perusing a lot of those pages of results), you learned which sites had most of the stuff you needed. Windows shareware was at Simtel-20, OS/2 at cdrom.com, Unix at sunsite, etc. Non-software usually necessitated going back to Archie and throwing searches at it until something stuck.

    Fast forward, and we're still doing that with the web. The only difference is that the amount of archived, non-software information (and hence its importance) has gone up dramatically. In light of that, I'd say that the search engines are more useful than one might expect.

    Unfortunately, that's not really good enough for practical purposes. Forget all of the techniques we're trying to tweak right now. Someone has to come up with a fundamentally different way of searching through indices; one which behaves the way altavista claims (but fails) to do. In other words, enter a question and have the engine _interpret_ the question before searching.

    But I don't see it happening for a few years. Oh well.

    --

    "People who do stupid things with hazardous materials often die." -- Jim Davidson on alt.folklore.urban
    1. Re:Unsearchable? Always has been! by gonzocanuck · · Score: 1

      I agree. I trained as a library technician during 1995-1997. I took classes, like above, in how to properly form a search query in a ton of different databases and on the net. I was a good little user and remember my Boolean and broading and shortening my search terms. Back then I would find all sorts of obscure pages with Webcrawler. These days I usually have to try three or four different engines before I can find what I need. Nope, nothing's changed, it's always been that way. The only thing I notice is what search engines have too many outdated links, I never use Webcrawler or Excite anymore. Not even Yahoo, that's the total pits.

      --

  87. Re:Black holes - Simple solution by Strauss · · Score: 1

    One solution, anyway. Simply tell the spider not to index anything more than X levels deep into a site. Where, of course, X is a relatively small number - say, 5. Alternately, for this sample, look closely at the URL. If you're looking at /foo/foo/foo/bar.html, then there "must" be something wrong with the path, so stop looking there and move back out.

    --

    Trifle not with Dragons, for you are crunchy - and go well with catsup.

  88. Dynamic Pages not indexed by DaveHowe · · Score: 2

    Hmm. How about this for an idea:
    when a webbot sees a dynamic page, it changes the query to ?Webbot - and expects to get back a specially formatted page starting <H1>Webbot index</H1> and followed by a set of comma separated keywords, a break, an URL, a paragraph, then the next set? The webbots would be happy, as they don't have to waste bandwidth and cpu time spidering over the site; the server should be happy, as it doesn't have to support the webbot's spidering, and the site owners should be happy, as they can specify what keywords each result will be indexed under. obviously, just reformatting the index to the product database could generate this page for an ecommerce site, and more static sites could just use a static statement of what their site carries.....
    --

    --
    -=DaveHowe=-
  89. Serveral Issues make searching suck. by smk · · Score: 1
    As a journalism student I often come to the point where the "garbage" in searche-engines makes it impossible to search the web for information. It is not only the database driven websites but the short-sighted design of websites which causes so much pain:

    First, the use of robots.txt is well known - but never used, when in a hurry. This means: Never used. Only when you get too much hits by a search engine, you put one in the directory. Even then by the lack of time you normally exclude all pages exept the frontpage.

    Second, Designers and Webmasters are not putting the technical possiblities to the max. Like using Apache paths and mod-rewrite to transform queries into a virtual path. Which would make a dokument look real. Even for a search engine. Since this is no out of the box feature, there will be no hope for this.

    But the worst one is a thing of the engines themselfes: The time between two visits of an engine. It is up to a month now, this means: Better use local search in slashdot, freshmeat, nerfpoint and else.

    Indexes like yahoo, infoseek, web.de(german) and others become the only hope to find a start and the use local searching.

    This is why altavista, hotbot, lycos and others got the additional "directory" feature. Compare them to google and you know how they looked two years ago. And by the time, google will ... well, maybe not. I hope.

    --
    * Smile. People will wonder what you think. *
  90. Search Hyper Text Protocol by friscolr · · Score: 1
    We need (and have needed for a while) a protocol for searching the web. Someone who's read the appropriate RFC, please write one.

    My Suggestions for said protocol:

    • A search server that everyone runs in conjunction to a web server that creates a standardized DB of the content of that site. This can then be config'd to include any dynamically created content.
    • Create a network structure similar to DNS, wherein upper-level servers will query lower-level ones at various intervals, receiving new DB's. Put it on port 75 (since i was born in '75 :-). These upper level domains are then your search sites.
    • Name it Search Hyper Text Protocol, or SHyT Protocol.

    It's not putting Altavista or yahoo or others out of business b/c you still need those top level servers to query everyone. It's solving the dynamic problem b/c each search site can create it's own DB however it wants, which also still gives Excite and Infoseek and the like a market in which to sell their search engines.

    Until this protocol is ready, create static pages from your dynamic content so that the conventional search engines will have something to catalogue.

    -f
    http://www.peruano.org/

  91. Searching by GC · · Score: 2

    For me the best way to search the internet is to go to a site dealing with the context of your query and search that site with it's own search engine (which most major sites have).

    It would be nice if a generic search engine working in the following way:

    1. User searches for say "Cisco VPN Routing"
    2. The search engine identifies sites www.cisco.com and other sites which are related to the search query string.
    3. Instead of trying to account these sites it calls on the search engine at the site matching the context and queries it instead.
    4. Returns the results of the search at cisco.com to the user.

    It's kind of like a distributedSearch, where the actual search is done by the holder of the data, all that the search engine actually does is try to find a context for the Search Query and find sites with their own search engines that match that context.

    So in answer to your question: My answer is No, the Internet isn't unsearchable, we just haven't implemented a reasonable standard for searching, which can be as important as routing when it comes to a network of the size of the Internet.

  92. php3 and other unlistable pages by Subwolf · · Score: 1

    I work for a .com company. We had this very issue as our entire website is dynamically generated with a single C program. It grabs various parts of pages created mostly with php3.
    How to index it? Frames. Put a 1 pixel frame at the top of the page. Hardly noticeable, and the search engine sees the frames page, where you can put a whole bunch of comments and meta tags and a tag to get the straight text, depending on the search engine needs.

  93. does fast == good? by maphew · · Score: 2

    I like a snappy search engine response time as much as the next guy, especially when I'm looking for something fairly current or mainstream. But how can you tell a search engine to tread farther off the beaten path?

    For example, a few days ago I was looking for the dip switch settings on an old 14.4k modem. Now I *knew* the info was out there on the web somewhere. I also thought it was highly unlikely to be in any of the major search engines in-ram indexes. I would have been quite happy to submit a boolean or reg-ex query to a search engine and then check back an hour later to get the results.

    In my mind, instant gratification search engines are useful and have their place, but I see a whole segment which just doesn't seem to be addressed. Is anybody even thinking about working on this?

    -matt

  94. Re: I _don't_ seem to be able to find what I need by c0mawhite · · Score: 1
    I've found that things are 100% searchable, but so many people are throwing "cruft" in (i.e. lying meta tags and invisible text) that I have to start including more and more exclude entries than I do include entries.

    There was a time when I could jump straight into an Excite power search and be assured that I'd find what I'm looking for within minutes.

    I don't think that PHP or high usage of CGI has affected things, tbh. But search engines like Yahoo!, who don't trawl for content, are going to get entirely more useful.

  95. But now.. by Nose · · Score: 1

    If ebay has their way, indexing data is equivalent to cracking into another's system illegally. I guess that means that we should do away with all search engines entirely...


    Nose

    --
    Nose -Common Sense isn't.
    1. Re:But now.. by dillon_rinker · · Score: 2

      If ebay has their way, indexing data is equivalent to cracking into another's system illegally

      I think what you meant to say was "If ebay has their way, accessing a copyrighted database and publishing information from it after explicitly being explicitly told not to is equivalent to cracking into another's system illegally."

      I guess that means that we should do away with all search engines entirely...
      I'm afraid you're right. We're pretty close to a time when most web pages will be served up programmatically from what amount to copyrighted databases. Indexing such sites without explicit permission from the content owners would be legally risky.

  96. New Standard? Look at Porn Sites by eshaft · · Score: 2

    Give anyone the ability to talk directly to search engines and you'll see what has been happening with those damn porn sites on a large scale - do a query for anything, and it'll come up with a totally unrelated porn site for you.


    People figured out how to abuse keywords real quick, and this would just make it worse. Which is why I wonder about the contnued existence of search engines. I use \. as my search engine - I use it to index my way into the web every day. I think that's the way of the future.






    PS I hate the G3 keyboards. They're tiny! It's like carpal tunnel syndrome x 5!!!
    --
    lf.o
  97. Page Generation as opposed to Dynamic Content by Cardinal · · Score: 1

    I think in some cases, it is easier on both web site maintainer and search engine for the content to be periodically generated rather than dynamically generated upon every request.

    Continuing the Slashdot example, for awhile during one of Slashdot's bandwidth crunch times, Rob was running CacheDot, a static version of Slashdot that was updated periodically.

    Sites that I run contain content such as product database representation, and these pages are regenerated whenever somebody adds/deletes/edits information in that database. This may become impractical (Generating a complete product catalog) for larger sites, but then it's just an issue of generating a particular category, or even locking it down to a specific set of files related to the product being changed. (In a sense, Makefiles for web sites) It's not terribly difficult to accomplish this generation work, and the result is cacheable product information, which I consider a Good Thing.

  98. Metadata at source by tbray · · Score: 1

    Everyone who thinks about this problem a lot comes up with the same answer: searching based on content never worked in the library context and won't on the Internet either. Metadata is the right way to go, which is why Yahoo and ODP are more popular than the robot-driven content search engines. The only model that has a hope is the Open Directory, but the right answer is a cultural shift where when people post data, they post metadata at the same time.

  99. Shopping Bots by interiot · · Score: 1
    There exist quite a few specialized shopping bots that search commerce sites or auction sites. Certainly the data they're pulling is dynamic.

    Does anyone know how they do it? Certainly some have special deals with the sites they're search, I think PriceWatch does this mostly, but there's so many products on these sites that it seems like they'd have to be spidering...

    Are these bots very specialized, or can their techniques be used for the rest of the 'net?

  100. Searchengines: wrong way by neuroserve · · Score: 1

    Searchengines consume an enormous amount of bandwith while only indexing small parts of the web.

    I think, distributed indexing is the way to go. Give everyone with a website a tool, which indexes her site. Create an open index format to ensure, that sites with dynamic content can create an index in that open format.

    Send compressed indexes to the searchengines everytime relevant content has changed.

    The problem: the common index format (while I think that the harvest-project produced such a format: SOIF). The searchengine companies will never cooperate on this - the users have to do it.

    But as long as the searchengine results are 'good enough' [tm], nothing will change.

    By neuro
    --
    -- it ain't over 'til it's over
  101. Making sense of dynamic content by tjwhaynes · · Score: 1

    With so many sites now offering dynamic, up-to-the-minute information, merely caching the contents of a page at a particular moment in time is at best only catching glimpses of these pages, and at worst leads to misleading search results when querying the search database. It strikes me that with these sorts of site, where the content of the pages is changing so rapidly, something more objective is needed.

    For example, the first clues about the state of flux of a particular page can be obtained by diff'ing the page against the previous copy held in memory. If the page is simply having extra items added in periodically, such as a FAQ, then the diffs will generally show that there is more being added than taken away, and the traditional snapshot method employed by search engines is fine. However, if the page is wildly different in the majority of its content (such as the slashdot main page), there is practically no point in making a copy of the page for indexing purposes. A much better solution is attempt to build a keyword database automatically for this page by lexical analysis of the text - even a '100-most-common-words' list (with 'the', 'and', 'is' etc filtered out) would be an improvement on the current situation. As repeated visits build up, this keyword list will refine itself and actually provide some reasonable pointers to the material likely to be found at the site.

    With any large database, particularly when you get to the stage where GB's of information are consider small fry, the need for efficient data mining and generation of useful indices becomes increasingly important. Database technologies are looking forward to a time when there will be a need for Petabyte and Exabyte storage and retrieval, and effective distillation of a web page's information, rather than a word-for-word verbatim cache will be the only answer.

    Cheers,

    Toby Haynes

    --
    Anything I post is strictly my own thoughts and doesn't necessarily have anything to do with the opinions of IBM.
  102. Cache Top N Pages as static by Anonymous Coward · · Score: 0

    One thing you can do is cache a portion of your pages, regenerate them every 5 or 10 minutes or whatever, if you cache the most interesting 10,000 pages, thats 10,000 pages that the search engines can add to their databases. Thats what we do at rivals.com, but its based on popularity, not interest.

  103. Artificial Librarian by SEWilco · · Score: 2
    Librarians have been indexing things for a long time, and for decades researchers have been trying to make computers do indexing. Applying AI technologies for indexing is nothing new. The challenge is making a computer understand a topic and text well enough to properly index.

    Browse a few relevant papers and find some keywords to search for more of the part of the field in which you are interested:

    1. Re:Artificial Librarian by jjeek · · Score: 1

      this is key -- for years and years good AI indexing has been around the corner. it hasn't materialised, and i don't think it's likely to in my lifetime.

      there simply is no substitute for a human domain expert doing a detailed job of indexing. look at how hopeless most automagically generated indexes are; key words are frequently omitted from text which discusses the ideas they represent, and often pop up in marginally relevant contexts, so without complex ontological analysis (which is way too hard to do properly right now, although easy for humans) it's no surprise that most indexes are full of pointless references.

      key word searching is pretty well dead, mainly because of the way HTML turned out -- noone can make any semantic sense of an HTML document based on markup anymore.

      i think there's two ways to go if you want reasonable searching, and both require quite extensive human up-front effort. first, standalone metadata. by this i mean a shared data model and means of representing the model (such as, say, the dublin core and rdf). second, adoption of new markup for content, using not only a shared grammar (ie the actual markup) but also a shared vocabulary (that is, a shared semantic for the grammar, and a shared understanding of context -- the context being dependent on grammar, but not limited by grammar).

      the first approach is easier (but not easy) to implement and applicable to all 'documents', including non-textual documents. the second is much harder to do, but will make search algorithms simpler, faster, and more predictably useful.

      there's no pain-free way out of the mess that we have currently though.

      cheers
      j

  104. Crawlers are a KLUDGE by abatoire · · Score: 1
    Webmasters currently have very little control over the indexing thus indexers have little intelligence. What is needed is a whole new protocol.

    Raw indexing of HTML leads to raw results that are often of no use whatsoever. What the indexers need is a way to query a 'site' for pages that should be indexed, how often to index them, what the general topic areas of the pages are etc. also needed are html tags that indicate the 'content area' of a page (so that navigational header and footer crap can be ignored) and a means to apply relative weight to areas of the page.

  105. The need for portals by drfalken · · Score: 1

    I'm not sure searching can be automated at all. That's why portals are going to play an increasing role and become increasingly specialized. It's worth remembering how young the Internet is right now and how centralized most useful resources are around very few sites. A recent BCG report indicated that 43% of online dollars are spent in the top 10 stores. That's insane considering the overall brick&mortar market they represent. Information is no different. As specialized websites continue to grow in number, specialized portals are going to develop to fill the need of finding useful information. Sites will have a much more interactive rapport with their portals than is currently the case. Search engines displaying thousands of hits per search will die off as their utility continues to diminish. And, ultimately, those who win will be the ones whose content and opinions are trustworthy (think Slashdot).

  106. Rethink the way we design sites and search them by JArneaud · · Score: 1

    A few suggestions here (for the web designers and the people doing the searching):

    1) The previously mentioned two-level web sites: enough static pages (with meta-tags etc.) to capture the search engine's interest. Backed up with dynamically generated pages for the bulk of the content.

    2) A huge collection of static pages refreshed from database-hosted source material. The static pages are updated whenever a change is made to the source. I'm sure a lot of web sites use this already in cases; it probably performs better when the number of updates isn't to high anyway.

    3) Using "well known" sites for your searching: I remember attending a web-design conference where one speaker talked about search engines actually increasing the search time when compared to users clicking through links (on a properly designed site). Sites such as the IMDB, the big web bookstores, about.com, slashdot and the major news sites provide so much useful information in one place there often isn't any need to check anywhere else.

    I tend to locate a site or two that excels at providing a PARTICULAR type of content and go straight there instead of a search engine. All of the companies working on these general-purpose "web portals" (ick) should give up. Locate a niche and work on providing the BEST content and comprehensive links that you can on ONE TOPIC (or at least use some common sense).

    4) Smarter search engines? I've switched to using Google almost exclusively; it often displays the site I'm looking for in the top 5. However, I've clicked through 5 pages of results, given up in disgust and found the perfect site at a later date by sheer coincidence. I suspect that the perfect search algorithms are going to elude us for some time yet, and the WWW is getting too big to allow human-aided searching to make much of a dent.

  107. Dynamic overuse by myrddin · · Score: 1


    It has been my opinion for a long time that database driven dynamic web pages are entirely overused. If more people used things like Website Meta Language to preprocess their web site and make them "dynamically generated but statically served" that would take us a long way toward being able to index content.

    There is a tradeoff. All of your content is then not only in a database it is also in the web pages. But in my experience most sites who are dynamically generating their content via PHP, ASP, perl, mod_perl, whatever, don't really have enough content to worry about it.

  108. The problem isn't the search engines... by Dagmar+d'Surreal · · Score: 1

    ...the problem is the people who have completely and totally ignored everything the W3C ever said about why and how tags and documents should be used. Okay, so it's not limited to just that, but it's the most obvious symptom.

    For example... How many sites have you see simply neglect to use the paragraph (<P>) tag? Instead they choose to make indiscriminate usage of the hard line break (<BR>) tags to separate blocks of text. This is silently wrong although the visible output is the same. Remember how WAIS engines could further qualify searches by how "close" multiple worsds in a search were to each other in a document? Here in HTML we have a way to group words into semantic bundles by paragraph, and people completely ignore it. (No, we're not to the point yet.)

    How many times have you actually seen people use the <DD> and <DT> tags properly by a web page author when they are giving definitions? Most authors seem to simply decide that they don't like the way the text looks, and use some oddball invocation of tables and/or transparent GIF images. Of course, this means that a search spider has no idea that it's looking at the definition of something now, where if the text were marked properly, any query of "definition of widget" or "definition:widget" would immediately return that page! Why do people dislike using <DD> & <DT> for definitions--the most popular answer I get is that they didn't like the way those tags formatted their text. They're entirely missing the point again that HTML is for marking different parts of a document with extra meaning. The browser is supposed to be what decides how it is shown to the user. META tags were abused by porn vendors and the other bottom feeding denziens of the net to the point where they are nearly useless now. Even with CSS1 and CSS2 waiting in the wings to allow authors to properly control document layouts, most people seem to be too lazy to create their documents properly, so long as it's not immediately obvious that they were the ones who did something wrong. (Seems like the attitude of some large corporations--and we're still not at the point yet.)

    The proper use of HTTP is also completely neglected by most web site administrators. The cache/no-cache pragmas, the last-modified times, the content-type declarations, these things were all meant to give hints to the remote client (which is not supposed to be assumed to be a browser) about what type of document they're looking at and how to deal with it. Instead we find sites who have marketing directors who insist that everything be done to inflate their hit counts as much as possible by preventing last-modified times from going out so the browsers won't cache the documents. We have entire sites which in their insecurity that someone, somewhere, might decide that the entire site sucks and needs to be done over (just the look, not the content mind you!) so they make the entire site out of dynamically generated content (like shtml- and php3-only sites), even though the parts that matter never change. (Apache now includes a number of things to get around this problem of template driven content by the way--see the 'Full' option for the X-Bit Hack for one such example.) (Almost there now.)

    I'm terribly sorry to have to point it out, but far too many web page authors have completely disregarded the fact that HTML is not meant to be used to format the text. HTML is meant to mark-up the document so that the browser can format the text, and thus, upwards of 90% of the web pages online today are a folly in progress.

    Tons of things to facilitate search engines were specifically included in the protocols, but go straight out the window in practice because of short sighted people who seem to think that the title WebMaster confers them automatic competence and understanding of the system itself.

    Do not blame the search engine for the ignorance of the masses (because they are asses).

    Style over substance is the real culprit. (Point!)

    1. Re:The problem isn't the search engines... by HerrNewton · · Score: 2

      You missed the best stuff, though! You forgot to mention:

      • {Hn} tags to denote headers instead of using stupid workarounds like {font size=n}. The former denotes structure and importance.
      • Use of {strong} and {em} instead of {b} and {i}, again denoting structure.

      Take a look at my site, theFYI. Still a work in progress as the backend isn't done (yet). Dig through the source and see how it's built. I would have loved to use CSS for element layout but, hey, the browser support just is not there yet. Stuck with tables for a few more years. BUT take a look at the structure around each article. The header is denoted with an {h1} tag, its appearance changed with CSS-1. The paragraphs are marked with paragraph tags and, well hell, the linked URLs are surrouned with {cite} tags. That's how you code indexable HTML.

      Used a lot of the same tricks on another site, http://www.ptrm.org/ and the site does well in the search engines. Specifically, check out the page on the PTRM's paleontology field tours. It does well in the engines simple because it's got 'dinosaur' in the page title and in a header tag.

      (Yes, I know that curly brackets don't go around HTML tags. I just didn't want to escape the angle brackets everytime I used an example of HTML)

      --

      ----
      Am I the only one who thinks Microsoft is a misnomer? Perhaps Macrosoft would be a better fit?
  109. Invisible Web Resources by rubble · · Score: 1

    The "invisible web" issue being discussed is one

    that is gaining a great deal of energy as more

    and more users, especially new and unsophisticated

    web searchers learn that many of the general

    search tools can not and do not make all that


    the Internet offers easily, if not entirely

    acceesible and/or retrievable.


    Searchers after learning this fact must become


    knowledgeable about "specialty tools" in the


    area(s) that they need information in. This is

    quite similar to finding the necessary specialty

    reference book on the library shelf.

    Below find the urls for a large and growing

    collection of these tools, that many visitors use

    as an acquisition tool to help in the selection

    process.

    Unlike similar "Invisible Web" resources, these

    pages have a more academic/scholarly feel to them.


    direct search-Main Page:
    http://gwis2.circ.gwu.edu/~gprice/direct.htm


    direct search-State (U.S.) Databases
    http://gwis2.circ.gwu.edu/~gprice/state.htm



    direct search-Searchable Bibliographies
    http://gwis2.circ.gwu.edu/~gprice/bibs.htm



    http://www.altavista.com/cgi-bin/query?pg=aq
    You can find more info on the Invisible Web here:


  110. Invisible Web Resources by rubble · · Score: 1

    The "invisible web" issue being discussed is one

    that is gaining a great deal of energy as more

    and more users, especially new and unsophisticated

    web searchers learn that many of the general

    search tools can not and do not make all that


    the Internet offers easily, if not entirely

    acceesible and/or retrievable.


    Searchers after learning this fact must become


    knowledgeable about "specialty tools" in the


    area(s) that they need information in. This is

    quite similar to finding the necessary specialty

    reference book on the library shelf.

    Below find the urls for a large and growing

    collection of these tools, that many visitors use

    as an acquisition tool to help in the selection

    process.

    Unlike similar "Invisible Web" resources, these

    pages have a more academic/scholarly feel to them.


    direct search-Main Page:
    http://gwis2.circ.gwu.edu/~gprice/direct.htm


    direct search-State (U.S.) Databases
    http://gwis2.circ.gwu.edu/~gprice/state.htm



    direct search-Searchable Bibliographies
    http://gwis2.circ.gwu.edu/~gprice/bibs.htm



    http://www.altavista.com/cgi-bin/query?pg=aq
    You can find more info on the Invisible Web here:


  111. Making Dynamic pages indexable by webmaven · · Score: 3
    Is it even possible to index dynamic pages? They don't really exist until the page is generated. Perhaps the best thing to do for sites that want to be indexed is to make sure they have a plain, vanilla index.html page that contains relevant keywords?

    It depends on what technology you're using to generate the pages.


    Zope sites for instance, are totally dynamically generated, even those pages that would normally be static. But the entire content of the site that's stored in the ODB is traversable via 'normal' URLs. This means that search engines can easily index your entire site.


    Note, however, that this only works if you've taken care to expose your content via links. If you've delibarately hidden your content behind a search interface (and you can still do this with Zope), then your site will be no more indexable than any other dynamic site.


    --
    --
    The real Webmaven is user ID 27463. I don't rate an imposter, because my ID is such a lame-ass high number.
  112. I heard about it on the radio... by SnakeStu · · Score: 1
    There's this great new way of finding what you want -- just type in Internet Keywords, and there you are! Right to where you wanted to go, every time.

    (Those who know my views on RealNames know I'm only kidding.)

    Having a database visible to a search engine depends greatly on the complexity of the database itself. Something simple (like the MySQL/Perl-driven Imprinted Products Source List ) can be given a default list-everything URL that doesn't look like a script. As size and complexity increase, of course, that isn't feasible (or even desirable), but it might be adapted to display a representative SQL View of a complex database, with sufficient content to give the search engine the "meat" it needs.

  113. A partial solution by unquiet · · Score: 1

    Mozilla's Open Directory Project can always use more volunteer editors to index web sites into yahoo-like categories. Editors are expected to have knowledge about the cats for which they are responsible, so there's human judgement involved. I know, it's not as efficient as meta-tags and spiders, et cetera but humans are creating web sites (mostly). Maybe in the long run, humans are required to sort it all out properly. ODP data is open source (I think I'm using the term correctly), and used by many web directories.

    --
    Got a beef? Plug a name into the Bizarre Rumour Generator!
  114. Shouldn't we use the right extension for the file? by smileyy · · Score: 2

    It seem to me that having URLs with extensions of:

    .asp, .php, .phtml, .shtml, .pl,, etc.

    is incorrect. What is being served is not an ASP script, nor is it a PHP script, nor it is a Perl program. It is, however, an HTML file (or a GIF, or a PDF, etc.), and should be labelled as such.

    If your server isn't smart enough to figure out how to generate the requested resource, and needs the generating program explicitly mentioned in the URL, then you need a smarter server. And if you aren't smart enough to figure out how to do this correctly, well...=)

    Remember, kids, a URL != a file. All the /. end user cares about is getting an article with the comments formatted appropriately. They don't care[1] if it's stored as a text file, or generated by Perl, or..

    [1] Well, they might care in a geek sense, but not in the way needed to read comments.

    --
    pooptruck
  115. The answer is yes. by Parity · · Score: 2

    The web is certainly becoming significantly more difficult to search, especially for informational content. Just -try- searching for information on a musician or an author... you'll get links to the like of music.com, amazon.com, whatever-your-topic-is.com, with a little one-or-two paragraph blurb about the person, if you're lucky. Hundreds of links like this to every little virtually-hosted e-tailer out there. Somewhere, buried in all this, will be the informational content hosted on a personal webpage or at some non-profit organization. Anyway, so, that's the problem, or an aspect of it, we already know this.

    Good news! The solution is coming. Maybe the solution is here. google.com has their unique approach to web-indexing. Another method that's probably going to be tried sometime soon is to look all the natural-language-processing technology that has been researched in the past twenty years, take the most efficient heuristics, and index pages by apparent-topic instead of by keyword.

    Then there are places like anipike.com - if it's a web page about Anime, it's on anipike, or it may as well not exist. I would -never- search the web for anything anime-related; I go through anipike.

    I'm really, really hoping that linux.com will become that useful to the linux community, but I don't think they're quite there yet. They may never be. Anipike is generally very fast to load, especially compared to linux.com ... probably 'cause it's mostly static pages and there are not so many anime fans as there are linux users. But that isn't really relevant; if linux.com is going to become the search-engine alternative for linux-resources, they need to respond quickly at all times of the day and night, otherwise 'Joe's Linux Links' is a better option.
    (Apologies to any Joe out there who is proud of his links page. :))

    Anyway, currently I still use search engines for Linux-stuff, but as I keep getting more and more hits on rpm files cluttering up the informational content, that may change soon. (Especially since I'm a debian user! I'm looking for information when I search the web, I know where my package is. :))

    --Parity

    --
    --Parity
    'Card carrying' member of the EFF.
    1. Re:The answer is yes. by MikeBabcock · · Score: 2

      I think you've mentioned the key here. The solution is the use of specialised search engines. I find music with http://search.mp3.de, I find hacks with http://astalavista.box.sk, etc.

      I look for most of my information with HotBot just because its advanced search option lets me really weed out the bad hits.

      --
      - Michael T. Babcock (Yes, I blog)
  116. What might work... by Rob+Kaper · · Score: 1

    Things such as... www.domain.com/index.php3?q=page&dummy=yawn.html So I named my .html page a bit odd... wasn't there also an <isindex%gt; tag for stuff like this?

  117. But do php listings rate well? by diediebinks · · Score: 1

    Php/dynamic pages may be indexed by search engines, but do they show up well in the search results? After all, it doesn't matter if your page is indexed or not when the listing is buried thirty pages deep.

    I think it a distinct possibility that php/dynamic pages may be penalized in relevance scoring. Does anybody know of a php page that is indexed in the top 10 results for any search term? (Yahoo doesn't count - that's a directory.)

  118. Bad ideas I have read so far. by segmond · · Score: 1

    I have read quite some comments, that suggested that the key to searching might lie in putting some of the load on the sites themselves. This includes ideas on how to write HTML, XML code or whatever, or including a search engine at the site itself. The problem with this is that people are lazy, If someone ones to write HTML code, they just want to write code, not worry if it can be searched. Basically HTML writers are like programmers, lazy people who will prefer if a function handled the error instead of having to write two lines of code to handle the error. We will have a successful engine when we have one that can search any content, what that means is, you don't need to write your HTML code in a particular way for it to be searched. You don't have to use special tags or what not for it to be searched, it is true that using special tags will be nice. But only .00001% of sites will use this. Search engines should never ever depend on whatever they search for how to search it, they need to determine the strucuture of the data
    themselves.

    --
    ------ Curiosity killed the cat. {satisfaction brought it back | it didn't die ignorant | lack of it is killing mankind
  119. Other extensions by cleopatra · · Score: 1

    One extension you are missing, which I personally feel is a worthwhile addition is .cfm and .cfml Those of you who are fellow Cold Fusion programmers will already know how wonderful and powerful a web application server it is... and how it's growing popularity and recent expansion to support Linux... is not to be ignored!

  120. Make lemonaide .... by Anonymous Coward · · Score: 0
    If the page/site is fairly static except for the dynamic content (ie logos, layout etc), why not use the dynamic part to your advantage ...

    put a "meta" field in the database and then use a cron job (or equivilant) to generate the core html at intervals to incorporate the meta field into the search fields? You couldn't go overboard, but if the script ignores blank fields in the database and you just drove with "significant" items, you could make this work.

    Just a thought

    Hank

  121. Trivial... by pkj · · Score: 1

    The solution to this problem for developers who want their site indexed is anywhere from simple to completely trivial. There are several common methods:

    1. A generic solution that works in many cases regardless of server software and scripting language is to generate (rip) static pages from the database on a regular bases (daily) and link them hierarchically. Not only will this allow the search engines to pick them up, but it may dramatically reduce the load on your web server if users start pulling these static pages rather than require cgi/database hits for the pages.

    2. When using PHP, just make .html the extension for php scripts. This will cause all pages to be parsed by PHP and therefore incur some additional processing overhead, but the newer PHP parsers are quite speedy.

      If you have a loaded site where speed is an issue, you could use .html for plain text and .htm for php scripts. I also believe that it is possible to specify what files should be processed by PHP and which ones should not. Of course, it has been over two years since I have played with PHP...

    3. If using Apache/Perl, it would not be difficult to hack the CGI module (I'm sure this has already been done...) to look at an alternate cgi encoding other than .cgi?xx or .pl?xx and then hand the page off the the proper module. I assume this is what is done on the many sites that show URLs in the form of:

      http://www.foo.com/fakecgi_23,5584,448.html
    These are, of course, only the simplest and most obvious solutions. There are no doubt many more.

    -p.

    1. Re:Trivial... by warmi · · Score: 1

      Your assumption is wrong. 23,5584,448.html is most likely StoryServer based www site ...
      Check out www.vignette.com ...

  122. moderation category based search engine? by termigan · · Score: 1

    What we need is a way to reliably categorize web pages that doesn't involve roving the whole net. If the author assigns a category, it's a starting point. Then you can search just a category. These are a start:

    • Current events
    • Emagizine
    • Forum
    • Personal intrests page
    • Commercial vendor
    • Ecommerce

    There are more major categories of course and there would obviously be sub categories to hone the topic down. Essentially what we need is the web equivalent of the dewey decimal system where a page or group of pages can be categorized and subcategorized. With categorization, self reporting is less prone to misdirection becuase they have to choose one. Then you create a slashdot like moderation culture to rate and correct the categories as search results are returned.

    Maybe you do still have to crawl the web to provide finer granularity than just categories, but perhaps crawling over provided links will help reduce the work load. After all, the whole internet isn't new every night. You could potentially prioritize crawling in some way as well.

    Maybe I should have pattented that before I revealed it...

    --

    Today is all we really have. We should all live it well: it is our stepping stone to all of our tomorrows.

  123. Search engines are definitely showing their age by ITShaman · · Score: 1
    Scientific American devoted an entire issue some time ago (1997) on The Internet, and a significant aspect of that was on information classification. The article discusses, among other things, how librarians (who'd ever thought us geeks would get along with librarians :-) and computer scientists should work together to bring some semblance of order to the chaos.

    All of this made a lot of sense even back in 1997, and I think that the issue is even more relevant now. Computer scientists know how to generate the content, and library science is very good at organizing and categorizing information, as well as indexing it for the easiest way to look it up. I read this issue from front to back twice, and it's a permanent part of my library.

    --
    I can no longer read Dilbert. It's too depressing, because it is too real. -- Hyperhaplo
  124. ColdFusion bites (sorry, but its true) by MagicMike · · Score: 1
    You're kidding right? Coldfusion has some of the lamest control-flow I've ever seen in a "programming language". Where's my switch? Nesting is a pain. Defining and calling functions (cfmodules, whatever) is a pain.

    I recommend a real language, like Perl or PHP or JSP or anything but ColdFusion. I had to do ColdFusion for 6 months and I just about went nuts.

  125. We're working on fixing this! by egnor · · Score: 1

    I work for a startup company, "XYZ Find" (www.xyzfind.com). We have nothing available yet, but we are developing an XML-based "search engine" that allows parametric search of data (in any XML schema!) on the Web. This will combine the advantages of full-text search engines (broad coverage, simple interface) with database query (precise parametric search, highly relevant results).

    This does require that database-driven sites expose their data as XML, but this is starting to happen already (look at RSS), and we believe (and hope!) that it is an increasing trend -- and one that will take off once XML search engines llike ours are available. (We're not the only ones doing this, though our solution will of course be the best .)

  126. Search Protocol. by Trifthen · · Score: 1

    This may sound odd, but I think it would work. Why don't the web developers and programmers work together on this one, and create a draft standard protocol (rfc) that can handle searches. What do we know about searching the web?

    • There are static pages that can be changed daily, making indexed content practically useless in the long run. Dynamic pages are just a more obvious version of these.
    • Spider traps can kill a spider, or make a site unsearchable.
    • The internet is slow.

    These are really enough to get a handle on how this could work. By problem 3, it's obvious we can't send a keylist request to each server in the world, and get their response (though this would be the best solution for maximum search depth.) What we can do, however, is present servers the ability to contact whatever search engine is the main hub(s) and send a keyword based tree. This will allow a search engine to grab information instantly and give a list of sites with that keyword or description.

    Most likely, though, the answer will have to be two-fold. What about people who just send in infinite key words? The first search the engine finishes is the "domain->keyword list" search. From that point on, either by querying each individual server on the original match to get more extensive information, or hitting a previously cached crawl, context and relevance numbers can be fleshed out.

    The final result, then, would be this:

    Domain->keylist->internal|external lookup

    The structure would still allow results like what current engines give us, but much more up to date. The protocol could also include a "last update" kind of field, so internal data doesn't have to be updated for x days/months/years. I think if we work on it, it could happen. But it's the only real alternative. Indexing the entire internet just isn't possible.

    Just a thought.

    -- Shaun
    --
    Read: Rabbit Rue - Free serial nove
  127. Thoughts on searching.. by segfault_ · · Score: 1

    Well, this is not entirely on the topic of indexing dynamic content, but bear with me.. the increasing difficulty of getting relevant search results has long been a pet peeve of mine. There are several factors that make good results hard to find:

    1. sites that abuse meta tags and include pages of keywords just to get hit more - this makes results of keyword searches less relevant.
    2. the explosive growth of the web - making 'quality' sites more difficult to find amidst the deluge of junk
    3. dead links, outdated information, changed dynamic content indexed by search engines - the web changes too quickly for most engines to keep up

    So what can we do to get high quality, relevant results without weeding through pages of URLs? It's not easy, but I've been playing with an interesting approach. First off, different search engines use different indexing/ranking methods: keywords, meta tags, link count, traffic stats, user recommendations, human categorizing. By combining the results of several engines using different index methods, you can cross reference the results and see who appears on all the engines. This gives you at least *some* degree of assurance that the URL matches your query. These results are ordered by number of engines that reported the link.

    Now that we have maybe 20 or 30 semi-relevant URLs, the next step (which I have not coded yet) is to retrieve these pages and parse them based on natural language processing techniques. This should give a good idea what kind of actual content the page holds - ie, is it an order form?, a page full of pictures?, a magazine article?, a threaded discussion?, etc.. From that, and from stored or learned user preferences, a better list of results can be show to the user.

    OK, so the easy part of this is done, and just automates some manual searching :) The drawbacks? Well, it still doesn't solve the problem of dead links very well, and it's slow (approx 30 seconds).. it has to hit 6 engines and collate and analyze results before you see anything. Adding the language parsing will make it even slower. Cacheing results in a database could speed up common searches, expiring them periodically and refreshing in the background..

    Is this where searching is headed? Maybe.. I don't pretend to know, and only started messing with it out of frustration. In any case, it seems to work pretty well already, and could probably be expanded into a pretty decent agent/search tool (open source, of course!) .. If anyone is interested in helping to develop such a beast, let me know.

    --segfault at netwinder dot org

    --
    "640k ought to be enough for anybody." -- Bill Gates ca. 1981
  128. How to get your dynamic pages indexed. by weave · · Score: 2
    Do search-engine spiders avoid you because your page addresses end in "forbidden" extensions like .cgi, .php3, etc? Do they ignore anything with a ? or & in the URL?

    The solution is easy. Don't use them in your URLs.

    Do not use GET args in dynamically built links, but hide your args in a longer plain ole URL. For example, a script at http://www/x/y can actually interpret http://www/x/y/z/ just fine and you can then parse off z as an argument.

    First, alias a directory that runs your CGIs, PHPs, etc. Like you would cgi-bin but don't call it that!

    Then, plant your cgi program(s) in there. The "arguments" further down would be in the PATH_INFO variable (which you'd have to parse out manually).

    So, in the case of http://www/aa/xx/yy/zz/ the script is in the aliased /aa directory. The script is named xx and the PATH_INFO passed to it, in the above example, would be /yy/zz/

    This works with Apache. Don't have Apache? Upgrade today at www.apache.org :-)

    1. Re:How to get your dynamic pages indexed. by Anonymous Coward · · Score: 0
      "You follow established standards, yet search engines still ignore you? This is your problem, not the search engine's problem."

      I'm amazed to see that so many people on /. have this attitude. Why bother to have standards at all?

    2. Re:How to get your dynamic pages indexed. by drago · · Score: 1

      I think the most simple way to achive the goal is to make a static frameset with apropriate META-Tags. This will get indexed, no matter which content the sub-sites have

  129. How To Create A Indexeable DB Driven Site by carbon60 · · Score: 1
    We simply build the site so that all content is accessed by virtual URLs.

    So http://www.somecompany.com/en/products/doors/alumi nium is actually passed to a script which seperates the arguments from the URI and then builds the aluminium door product page with the english template. It works rather well and everything can be configured to be text/html and .html without problem.

    An example in PHP is available at http://www.phpbuilder.com/column s/tim19990117.php3.

    A.

    --
    Adam Sherman

    --

    --
    Adam Sherman
    Freelance Geek
    1. Re:How To Create A Indexeable DB Driven Site by sysop · · Score: 1

      You can go one step further with the apache ForceType directive, there's no reason why you can't end such a URL with .html, so it looks just like a normal page.

      We did this with a newspaper site that wanted to keep all the old articles online, and have them indexed, yet still run fresh ads and links on them. We also used an intermediate 'virtual' directory, which allowed different URL's for different entry points, to the same content. Many Thanks to the authors of PHP.

      Another hint is to set an Expires: header in the future so that the search engine thinks that the content is static.

  130. It's harder than you think... by Anonymous Coward · · Score: 0

    ...because one person's "fluff and spam" is another person's "real information".

  131. An open-source style solution by Ped+Xing · · Score: 1

    Several years ago it became clear that the net was growing too fast for search engines to be able to keep up. At that time I came up with a design to solve the problem of scaling that reflects the open-source solution: through volunteers.

    You have two categories of volunteers, the Spinners and the Weavers. The Spinners each voluntarily search some small part of the web via a spider each night. The Weavers each publish to the Spinners a list of queries that they are interested in. When a Spinner's spider hits a new page that matches a query, or receives a new query that matches a previously indexed page, it sends an email to the Weaver. The Weaver can look over the web pages coming in and create web sites that provide easy access to those pages, as they apply to the particular subject the site.

    I sent this suggestion to the Open Directory people, since I think it is a perfect tie-in to their concept. The editors in the Open Directory project would be the Weavers who could separate the wheat from the chaff. Unfortunately, I never heard from them and let the idea die.

    I'd be willing to ressurect it if others are interested. Feel free to send me an email.

  132. It's easy to make your PHP pages indexable! by Anonymous Coward · · Score: 0

    From the PHP Knowlege Base:

    How can I pass variables in a form that won't scare off search engines?

    Mailing List, Nathan Wallace
    Jun 28th, 1999 06:04

    It's easy. Just use:

    /local.php3/var1/var2

    Then in your PHP code parse the url and extract the variables. This works since Apache finds the local.php3 script and ignores the rest of
    the url.

    For example:

    http://www.server.com/page.html/wilma/betty

    and then in the script:

    $res = explode("page.html/", $REQUEST_URI);
    $vars = explode("/", $res[1]);

    $fred = vars[0];
    $barney = vars[1];

    If you want to get rid of the ugly .php3 in the middle of your url take a look at the Apache ForceType directive.

    http://www.apache.org/docs/mod/mod_mime.html#for cetype

  133. Re:Shouldn't we use the right extension for the fi by Neil · · Score: 2

    The client shouldn't infer the type of the object based on an "extension" in the URL at all ... that is what the Content-Type header is for!

  134. What exactly do you expect the search engines to do? Okay, they can't instantly index new dynamic content: I don't expect them to.

    If I'm looking for news about the plane crash in Yagadoodlestan that killed 8,000 people, I'm not going to go to a search engine and type in "Yagdoodlestan plane crash", I'm going to go the New York Times and see if they have an article, or an AP story in the margin. If I'm looking for a review for the G88-superdooper motherboard that I'm thinking about buying, I don't go to a search engine and type "G88-superdooper motherboard review", but I might type "computer hardware motherboard review" and expect to get links to a bunch of hardware review sites, any decent ones I expect would have a review of the product I'm looking for.

    Conversely (inversely? whatever...) no one is going to type "international news" into a search engine, even though that might be the best way to find the NYT, they're going to go there because they heard about it from someplace else.

    An example from my own recent web-browsing life: I heard about some site called bluesnews from a Quake'n friend a couple years ago, so I check it out. They have links to articles at some place called /. ("/.," I say, "what the $&*# is that?") so I check it out. I bookmark it, and now I'm happy. No search engine required. Last night I see the ad for Man on the Moon and start talking with my mother about Andy Kaufman "You didn't hear Mom? He died." "When," she says, "and from what?" I dunno, so I go to google and I type "Andy Kaufman" and find the answer.

    So what, exactly, do you expect these search engines to do? Sites like the New York Times, BluesNews and SlashDot serve one purpose (bringing me news about topics I care about on constantly updated dynamic pages) and sites like AndyKaufmanFansOfIdaho.com bring the occasional bit of static triva goodness when I need it.

    Works for me, what are all of you doing???

    "God does not play dice with the universe." -Albert Einstein

    --
    Those who fail to understand communication protocols, are doomed to repeat them over port 80.
    1. Re:??? by jo44 · · Score: 1

      ...and that's the difference between people who know how to find what they're looking for on the web and those who don't.

      That's the way I use the web too. Currently that's probably the only intelligent way to do it. But the real question is, does it have to be that way? Why can't we come up with a better way, so that I can type in "Yagdoodlestan plane crash" on the day after it happened, and find links to news articles, from around the world, that cover the event? I'd be lying if a said that the above mentioned search technique always worked for me, especially when trying to look for something more obscure. I'm also tired of getting 100 links to the same damn site when I search for things.

  135. Vast Content and Distributed Indexing by ReadParse · · Score: 1

    I saw somewhere a few months ago (don't you love it when people really back up their information like that?) that the growth of the web vs. the available indexing technology meant that only about 4% of the web was being indexed. Goodness, that's a surprisingly low number, isn't it? I've heard it mentioned, and have often mulled over the idea myself, that some sort of distributed indexing is probably the next logical step. With the apparent successes of distributed.net and SETI@home, this is at the very least intriguing. So let's just say, for the sake of discussion, that I had some time on my hands and the motivation to see a monster search database project through (these are both very hypothetical points). I could create the central database and write some client code. My dedicated AOV (Army of Volunteers) could come in veritable droves to download the client code and join the team (which they certainly would, right? Right?!?!?) Anyway, their client would initialize and get one of the starting URLs from the root database and go to down indexing and spidering. Shouldn't be too much of a bandwidth hog, since it will be just text, but it would be constant. Maybe not a good idea to do this with an analog modem. When the client has "eaten it's fill", or once a day, or something like that, it would slam it's content my way.... ah, there's a potential problem. That's a lot of content. Well, it's all text, so my client could maybe get some decent compression out of it by gzipping it up. Still, it wouldn't exactly be trivial. Thinking on about this, why will it help me to have my AOV doing these HTTP transactions for me when my server could do them it's own damn self. Surely my server would have a big enough pipe that the bandwidth wouldn't be a problem, and I could start any number of processes. What's the big difference between web indexing sites like distributed.net and SETI@home? Ah, it's processing power. That's what's required for the "traditional" distributed application. That kind of number crunching isn't helped by bandwidth... you need such a honkin' processor to do all that chewing that it's not cost-effective to create the system... it makes a whole lot more sense to distribute the work to any number of "normal" machines, thereby simulating a "super computer". That's right... it's all coming back to me now. So, what would be helped by distributing the web indexing process? And wouldn't the smart fellas at Google or AltaVista or have thought this through by now and come out with some sort of beta? Hmmmmmmmmmmm...... What? You actually read this whole thing? Sheesh. That's impressive. Oh well, might as well moderate me up :) RP

    1. Re:Vast Content and Distributed Indexing by Anonymous Coward · · Score: 0

      The problem as I see it is to somehow track the attention of users, and do it in an automated fashion. One solution could be to integrate an internet browser and a search engine into and information gathering and dissemination tool. Netscape has already taken the first baby step towards this with it's "what's related" button, which tracks the link-hopping of Netscape users and distills it and spits out a short list of sites that are related to the site you are currently browsing. I envision doing this in a much more systematic and relevant fashion and feeding the results back to a search engine. The browser would keep track of your progress through the internet, and most importantly keep track of the attention you give to the sites you link-hop to. The easiest way to do this would be to measure the time spent on a link that is followed. If you follow a link, and then hit the back button after 2 seconds, then obviously the site was not too relevant. But if you spend time on the site (presumably reading the text) and digging deeper into the site, the browser can measure that as well. The browser could check to make sure that that the mouse is moving to figure out when you are actually paying attention, or if you have just left the page up during an interruption. (DirectHit seems to have implemented a crude version of this idea). Then, the next step would be for the browser to feed this information back to a central search engine site, where it could be combined with Google or Altavista type information. There the history of pages viewed in the browser, and the links followed, could be combined with the pages gathered and indexed with web-search 'bots. One function would be to "tell" the search engine about obscure sites for further follow-up - every site gets viewed by somebody's browser eventually. The other function would be to provide a "roadmap of significance" - measured by the aggregated activity of millions of users. If Google can weight significance by measuring the weblinks "upstream" of any given site, the time and attention data gathered from the "downstream" of any given site should be able to be evaluated as well. While in the long run, expert systems and neural nets may be able take up some of the slack, in the medium run, tapping into the aggregated individual interest in the internet could be a way to fill the gap.

  136. Content-Free Weighting by quux26 · · Score: 1

    Why not just have an index that weights the site using a text:images ratio. If it has 120k of images and 2k of text, assume it's content-free.

    My .02
    Quux26

    --

    My .02
    Quux26
    www.crashspace.net
  137. Re:Spider traps ... been there, accidentaly by duplicateAccount · · Score: 1

    I remember building a simillar trap by accident.
    It's much simpler to do than your receipe:

    I did a content management system, which did not impose an iherent hirarchy upon the data (it had a net instead).

    The nodes where presented to browsers like directories in web servers.

    Poor search engine - did not believe in circular links between directories. :-(

    As for the handling. After retrieving several hundred M (out of my 2M test data) it stoped and came back next day. I could not efford that...

  138. Indexing by number of links to a URL by Anonymous Coward · · Score: 0

    Where was I reading this? Maybe it was someones thesis paper.

    Basically, the idea is that you collect all links and some basic keyword scanning (could be meta tags) and then you build an index based on the fact that many pages that claim to be about keyword X all link to page Y (any URL extension, dynamic or not) which is also about X. The more links there, the more likely that content is something valuable. It sort of polices itself and results in the top search item coming back as the most linked-to page.

    Very slick, but I haven't seen anyone implement it yet. --ds

    1. Re:Indexing by number of links to a URL by Mike+A. · · Score: 1
      You're probably thinking of Google - a search engine which uses a technique that's more or less what you describe.

      Of course, one of the effects of this indexing technique is to encode rather interesting ideas about who is, say, more evil than Satan...

      --

      --
      Do I look like I speak for my employer?
  139. This is why we need Open Source indexing engines by mparaz · · Score: 1

    I haven't found a recent Open Source indexing engine that could do 1/10th the scale of Google assuming you had the hardware to spare. If there were, then folk can run Open Source indexing engines on small parts of the net (distributed by network topology or geographically) and a meta-index can handle those. Then we have local customizations for dealing with dynamic content.

  140. Google, with a twist would do it by jovlinger · · Score: 2

    Think Google.

    Google works on the idea that pages that have a lot of incoming links are authorities on what they discuss, so they should be ranked highly.

    A modification of this is to not only rank a site's authoritativeness (eh?) this way, but also what kind of content it has. So if 10K geeks all have homepages that include the words "geek" and "computer" and also point to /., it is reasonable to assume that regardless of actual content today, /. typically is a good result to return for the search "geek sites".

    Of course, some of those homepages will also have the words "tennis" and "knitting", that will be spuriously attributed to /., but the idea is that they will be outliers, and drown in the noise.

    This basically is keyword indexing, but the keywords are dynamically determined, rather than using the broken meta tags.

    The big problem with this approach is implementation; the association tables are likely to be huge.

    Also, you assume a large sample size, so that the outliers will cancel.

    Johan

  141. What problem? by Anonymous Coward · · Score: 0

    Actually I don't see a problem. Stuff which most people are looking for is getting more plentiful - so what if you can't find all of them - they're usually highly redundant - 1000 people putting out the same info. Think of these as backup copies.

    For the rest of us, we usually know how to find what we are looking for... If it really is there but it can't be found it probably wasn't that good anyway- if the page isn't linked to by anyone, and the author can't figure out how to get it indexed or hasn't bothered, then either the content is not worth reading, or the author doesn't want it to be read (doh).

    Anyway, if they want they can always decentralise things:

    For example every site could have a standardised search engine helper which will allow a keyword-url list to be dumped out, possibly in a compressed format.
    e.g.
    list of URLs=
    urla,urlb,urlc,urld
    list of Site specific keywords
    (nonstandard keywords)
    keyword1,keyword2,keyword3
    Site specific keyword to URLs
    X1=a,b
    X2=a,b,c
    X3=d
    Standard keywords to URL list
    S1=b,c,d
    S2=d,e
    (don't need to list standard keywords - understood)

    A more sophisticated version would list categories which the site is about. Then when the search engine is searching for some unlisted keyword but in that category, it can actively query the site's search engine on that keyword.

    e.g.
    cat.scanner.medical
    (not cat.pets)

    You could have something like DNS, but by category and keyword.

    If idiots want to be in every category fine. People could always put -dumbsite in their search.

    Cheerio,

    Link.

  142. Fear of the unknown by jafac · · Score: 1

    There seem to be two schools of thought here. The folks who do searches, and are satisfied at what they get, and the folks who KNOW how searching works, and the breadth of information that exists, and KNOW that there's technically no way that all of that information is just plain not going to be included in any search.

    It's unsettling to these "second school" people, because it's like looking for something in a library, knowing that you can't even go into 90% of the rooms of books and scrolls and papers.

    Computers are supposed to be our security blanket that no information is out of our reach, or ever becomes lost. Unfortunately, this is "capital-R Reality", and even with the great equalizer of the internet, you just plain can't have it all. Steven Wright said, you can't have it all, where would you put it? The computer answers, digitize it, and put it online. All you need to do is build enough disk drives.

    It IS a noble goal. And perhaps even realistic. But not with our current technology, and system of management (ad hoc/capitalist survival of the fittest standard) of that technology.

    I wish I had a nickel for every time someone said "Information wants to be free".

    --

    These are my friends, See how they glisten. See this one shine, how he smiles in the light.
  143. And How Do You Proxy Them? by try67 · · Score: 1
    An additinal problem that originates in the move to dynamic, query-created, web pages, is that it makes Proxying obsolete.
    How will a Proxy server treat PHP/PHTML/PL/etc files? probably ignore them, or simply download them (which is worse...)
    If i load /. i get all of my slashboxes just the way i set them to appear, but if someone else will try to access /. from my LAN, the proxy will know that a whole different page (according to his prefs.) needs to be taken, and therefore will reget the entire HTML code...
    This also can't be solved in traditional methodes such as telling the proxy to refresh the page he has every X minutes, since every user demends a different page (either by query or by cookie prefs.).

    I have given this subject some thought and came up with an idea:

    Have the proxy store the cookies and then download the pages according to them: User access site via Proxy Server -> Proxy Server loads user Cookie file -> Server checks current stored page -> Server downloads requested page to a dedicated \user\site dir, if necessary -> Server updates latest Time of Page Load -> Server sends page to the user's browser.
    If this will be implemneted, Proxies will be much more efficient and could be used to further minimize banwidth load.

    --

    To the fool, he who speaks wisdom will sound foolish. ---Euripides
  144. What problem? by Anonymous Coward · · Score: 0

    Actually I don't see a problem. Stuff which most people are looking for is getting more plentiful - so what if you can't find all of them - they're usually highly redundant - 1000 people putting out the same info. Think of these as backup copies.

    For the rest of us, we usually know how to find what we are looking for... If it really is there but it can't be found it probably wasn't that good anyway- if the page isn't linked to by anyone, and the author can't figure out how to get it indexed or hasn't bothered, then either the content is not worth reading, or the author doesn't want it to be read (doh).

    Anyway, if they want they can always decentralise things:

    For example every site could have a standardised search engine helper which will allow a keyword-url list to be dumped out, possibly in a compressed format.
    e.g.
    list of URLs=
    urla,urlb,urlc,urld
    list of Site specific keywords
    (nonstandard keywords)
    keyword1,keyword2,keyword3
    Site specific keyword to URLs
    X1=a,b
    X2=a,b,c
    X3=d
    Standard keywords to URL list
    S1=b,c,d
    S2=d,e
    (don't need to list standard keywords - understood)

    A more sophisticated version would list categories which the site is about. Then when the search engine is searching for some unlisted keyword but in that category, it can actively query the site's search engine on that keyword.

    e.g.
    cat.scanner.medical
    (not cat.pets)

    You could have something like DNS, but by category and keyword.

    If idiots want to be in every category fine. People could always put -dumbsite in their search.

    Search services could always put "filter out this site" links in the results screen.

    Cheerio,

    Link.

  145. Oh my! (Was:Re:Spider traps ...) by tagish · · Score: 1
    Since opening fifteen short minutes ago, the spider trap at this little Servlet has taken several hundred hits and the hit rate seems to be climbing pretty steadily. Advertising anyone? ;-)

    --
    Andy Armstrong
  146. Re:Spider traps ... already exist -- wpoison by shub · · Score: 1

    They've been around for a while. Ron Guilmette created wpoison a while back. There's even a Wired story about it.

    Unfortunately, wpoison appears to have since disappeared, although Ron never mentioned this to me.


    Interestingly, I found out all this information doing a simple Google search on "wpoison". ;-)

    --
    Brad Knowles
    http://daily.daemonnews.org/ -- if you're not
  147. I just... by Anonymous Coward · · Score: 0

    trained an infinite number of monkeys to browse for what I'm looking for, and they always find it in no time (literally!)

  148. Are Meta-tags dead? by Royster · · Score: 2

    I have a few (very pertinent) meta-tags on the information page for a mailing list that I run. The tags are designed to get hits from people looking for my list. But, it seems that the meta tags don't work in some of the major search engines. Perhaps the engines have caught on to the practice of embedding surperfluous tags in order to get hits on engines. I think I'll have to rework my page to make sure that the key phrases that I'm trying to get hits on actually appear in the text.

    --
    I have discovered a truly marvelous sig, unfortunately the sig limit is too small to contain i
  149. Dammit, I Previewed this Post! by ReadParse · · Score: 1
    Well ain't that a kick in the head! I had two links in this post, both were checked and previewed. Now I see the post on the site and my last link is one of those incredibly annoying endless links that was never closed.

    I promise, I previewed and it was fine. Any further critiquing of the link problem in my post will be superfluous :) Thanks.

    RP

    1. Re:Dammit, I Previewed this Post! by Anonymous Coward · · Score: 0

      Feel not bad... there are several bugs in the Slashdot posting system which I have reported but which The D00DZ refuse to deign with even a denial, nevermind a repair. Perhaps you hit one of them. Basically, it is unsafe to use the form that is returned to you in a Preview because the TEXTAREA is improperly initialised. If your browser lets you, you should always go back to the initlal form.

  150. Re:Shouldn't we use the right extension for the fi by smileyy · · Score: 2

    Yes, but many file systems, which may be the destination of the results of the HTTP request, *do* make use of extensions to determine file type. Though, perhaps, storing MIME-type meta-information would be better, we're stuck with what we've got.

    Also, I mean the URL to also be used as a user-interface. For example:

    http://slashdot.org/99/12/14/1154243/comments

    would generate your browsers's preferred format, whereas requests to:

    http://slashdot.org/99/12/14/1154243/comments.pdf
    and
    http://slashdot.org/99/12/14/1154243/comments.scml

    would return the PDF and the Slashdot Comment Markup Language (an XML app) respectively. This could be done with content-type markers, but the interface is much poorer than simply using file extensions.

    --
    pooptruck
  151. This is no problem for mod_perl by Anonymous Coward · · Score: 0

    Anyone seriouslly pretending to build professional web domains should consider these tools: mod_perl : http://perl.apache.org HTML::Mason : http://www.masonhq.com With those I build very VERY easily .html dinamic content, even creating virtual directories and files from databases.

    1. Re:This is no problem for mod_perl by eAndroid · · Score: 1

      yup it is true. i agree.

      --

      I can't spell or type, but that doesn't mean I'm unusually stupid.
  152. Another reason to only index the splash page. by maney · · Score: 1

    It seems that everyone and their brother is out there trying to get every single page on their site indexed in the engines. This is the wrong approach for many reasons: it help prolifigate link rot and search engine database bloat, it increases the time it takes for a spider to "index the web", it decreases the effectivenes of the search engines, etc.. etc.. The better way to resolve this is to only index the "splash", or first, page of a site. Index that page completely. That page should contain all of the keywords and such necessary for a search to find every relevant item or page on your entire site. That the spiders only have to index a MUCH smaller portion of the web and will still return all of the relevant information in a much quicker time with much smaller databases. While at the same time allowing the restrictions on number of keywords and size of descriptions to be greatly increased. Of course this requires web designers to actually have some sort of interest in the public good so that they provide good, valid keywords and information as well as decent navigation internally to the site. It also creates problems for people who are not on their own domain, but are merely someone's ~. I think though that these things can be ironed out in a re-write of the robots.txt file format.

  153. Self-generated index by www.thefish.com · · Score: 1

    It seems to me, after all the discussion, that webmasters/hosts should have to generate their own index of a site based on certain criteria (of course).

    For instance, I run a little program called swish-e (which some of you may have heard of, if not, check it out) to run small search engines on several of my sites. What if every host/domain/site had to run a "swish-e" index of their space and post a "spiderindex.txt" file in their main directory. Generate the index for the spider before it even gets there.

    That would open up a whole new can of worms, probably, and add another layer of complexity to creating your own site, which is a good thing IMHO. Professionally produced sites, or those produced by the web-wise at least, with "spiderindex.txt" in them get indexed better than Joe Smith's personal home page with a few meta tags.

    -Mark

    --
    -- I lived through the IPO Rush of '99
  154. OK, I've Gotta Test It by ReadParse · · Score: 1
    Go ahead and moderate this down... it doesn't matter to me. I just looked at the source of my previous post and not one, but BOTH of my links are missing the closing anchor tag, and it's just not possible that I would miss both of them... besides, they were right in the preview.

    So here we go, two links: One with a captital A in the closing tag and one with a lowercase a:

    slashdot test | slashdot test

    And now some additional text to see if the link turned off.

    OK, I just previewed it and it's perfect. Now submitting...

    Sorry for the test, but you gotta relieve curiosity.

    RP

  155. Oingo.com and Ontology based searches by winterstorm · · Score: 1
    There are technologies that will let us search very large database effectively. Oingo uses what might be called an ontology based approach to searching however its knowledgebase is pretty small right now.

    I'm suprised nobody has licensed the Cyc software/ontology for use in web indexing. Actually I could be out of date and someone might have already!

    The key to good indexing and search lies in scanning for knowledge and not "words". Unfortunetly more and more webpages are designed to be as noisy as possible and contain little information. For example millions of webpages contain navigation menus however the "knowledge" of what can be navigated is stored as images, which is completely useless... the "knowledge" is completely lost and indexing is difficult.

    There needs to be more use of meta-data in web pages if we want to index them for the knowledge they contain. Until we can index them we can't search them.

  156. Distributed databases: Add to Apache Web server by gbnewby · · Score: 1
    I'll make this quick - I didn't see this type of suggestion in other postings, so hope I didn't miss it.
    • Consider: The process now is searching for PAGES.
    • Consider: The process for the future will probably need to be searching for COLLECTIONS first, then pages within the collection.

    Possible scenario: Apache httpd gets a couple of add-ons that speak Z39.50 (protocol for distributed searching). The search engines build a database of what these Web servers say is on their site (could be multiple collections; could be dynamic content....).

    An information seeker would use a search engine to determine which Web servers (aka, "collections") might be appropriate for the query. Then, the query could be delivered to the best servers (for searching on their own sites).... The main benefits are:

    • Scaling: search millions of servers, instead of billions of pages
    • Customized search methods for each collection. Some might assign particular key words, have methods for searching images or multimedia, etc. - gone is the "generic" keyword search
    • Currency: instead of searching on a centralized database that's never current, you'd be searching at the Web server level, which could be as current as that site's maintainer wants it.

    Drawbacks: slow speed of the net & slow remote servers; porn and other misrepresented content; need to integrate and rank results from multiple Web servers...

    We already have the protocol for this type of searching (Z39.50 - remember WAIS?); the next logical phase is to integrate it into our most common tools, especially Apache.

    1. Re:Distributed databases: Add to Apache Web server by otisg · · Score: 1

      The hard part is persuading Apache team to include something like this in the Apache package....

      --
      Simpy
  157. Databases pages useful? by hawkfish · · Score: 1

    I was just wondering how useful database pages are. I suppose some of them are (e.g. the local library) but most of them are BarnyCorp's list of useless widgets and I'm rather glad that the engines don't index that stuff...

    --
    You will not drink with us, but you would taste our steel? - Walter Matthau, The Pirates
  158. I never metadata I didn't like... by meta4 · · Score: 1

    Anyone ever heard of metadata? Instead of indexing every word in a document we should be capturing accurate, relevant metadata about it and facilitating the searching of that. As non-text content increases (like mp3s, videos, images, audio streams, animations, etc.) on the internet, the need for a new search paradigm increases as well. Of course there's the the Dublin Core, but much more interesting to me is the IEEE LTSC's work. Their metadata standard, currently at version 3.8, is very close to being finalized. In addition to providing general fields, it also includes some that supposedly facilitate the instructional use of the object of the metadata.

  159. Let's start by making slashdot searchable! by Anonymous Coward · · Score: 0

    The nifty-dandy slashdot search engine only works for *non-archived* posts --- that means past two weeks *only*! The rest are rendered as static HTML, but you'd lose all the special slashdot searching features that way (search in story, filter by topic, etc) even if a major search engine is indexing them -- which I can't find any evidence to support. So, techno-idealists, let's start *here*. How should *slashdot* be indexed?

  160. Just add "and not sex". This excludes porn sites! by Anonymous Coward · · Score: 0

    It works stunningly well!

  161. true content vs. commercialism by The+Queen · · Score: 1

    Another thing the search engines could do is figure out how to ignore "trolling" pages. i.e. those which are nothing but index spam, a catchy title, and a refresh tag to ship your browser off to fetch their actual main page
    Right on! Nothing burns me more than to see the 'enter here' page when I go to a site. Some flash or other animation or some huge graphic that loads up, and then either sits there, forcing you to 'click to enter' or redirects you to the rest of the site, which is where I wanted to be in the first place...what purpose does that serve? Oh, sorry, it impressed the client you built the page for. (I will admit I have done this once or twice, but not until after trying to talk them out of it.)

    As far as searches, I agree that we need a new standard, one that is not only intelligent and dynamic, but that can outwit those who try to trick it. I believe that's quite a way off... until then I'll keep reading /. and Search Engine Watch.

    The Divine Creatrix in a Mortal Shell that stays Crunchy in Milk

    --

    The House Between - Original Sci-Fi Series
  162. Is the Internet Becoming = WWW? by kerberos · · Score: 1

    Why ask if the Internet is becoming unsearcable and then only talk about World Wide Web issues?

    The Internet actually consists of many things, of which WWW is only a part, but I'm beginning to realise that more and more people, in particular those who are new to "The Net", have a hard time understanding that.

  163. Searching by Anonymous Coward · · Score: 0

    1. Have a second domain name system based on topic rather than location, the way things are organized on usenet. E.g. sci.astro.cosmology.inflation

    2. Create a legal equivalent of a self-reproducing worm, which requires the cooperation of a site in order to gain access. Give each copy a reasonably short lifetime: say, 15 minutes.

    3. Have a number of well-known, moderated, top-level sites for various topics, with links to other sites dealing with the same topics.

    I agree that the current system needs work; I once searched for "Andromeda galaxy" and "radial velocity" and of the sites that the search turned up, one was a lesbian site and another was a neonazi site.

  164. the end of keyword searching by eries · · Score: 1

    I work for a company whose primary product is a search engine. However, in our case we allow searching not of web pages but of student "profiles" (which are basically like super resumes). We made a conscious design decision at day one to not even allow keyword searching because of its incredibly imprecise nature. Rather, we collect a lot of meta-information about our clients and then use sophisticated sorting techniques to do hihg-precision searching. If anyone's interested, please feel free to email/post a comment. I'd be curious to know if anyone has suggestions about how this should/coould work better.

    Thanks

    Eric

  165. Untangling the Massive Confusion Over XML by Baldrson · · Score: 1
    XML is really an alphabet in with a virtually infinite number of named types of parentheses --with a syntactic constraint that parentheses must be matched.

    To say "We've written the system in XML." is about the equivalent of saying "We've written the system in ASCII and matched the parentheses."

    In the vast majority of applications, when people say "XML" they really mean something like RDF, BRML, RELML, etc.

    The best use of XML currently is to simply dump existing relational databases to the web and index them with XML-oriented search engines like XSearch for things like RELML.

    One of the pit-falls of "XML-oriented search engines" is that they fail to provide basic query capabilities such as numeric comparison on the indexed fields. This is really unnecessary since all they need to do is put the XML data back into a relational database on their end and index appropriately on the numberic fields. If they don't like the schema checking, they can always use LDAP and turn off schema checking.

    An example of good use of XML via RELML is at www.nmre.net. Check it out.

    Beyond this simple "dump the legacy rows to web pages" approach to E-commerce searching, there are the inferential systems that are more or less the equivalent of inferential databases. In these schemes, rather than storing literal values for the database rows in XML fields, a set of rules of derivation are included along with the XML data fields, and the actually index values that are not explicitly specified are derived prior to indexing. One might think of these as methods derived attributes as opposed to stored attributes. This is the direction Guha et all were trying to take things with RDF, but IMHO, failed to find the "sweet spot" of simplicity and power required for a new standard.

  166. Separation of commercial and non-commercial by jfunk · · Score: 2
    I'm involved in Dizz-Net and I was thinking of a way to separate commercial websites and websites that may or may not be somewhat commercial, yet contains useful information (Slashdot is commercial, yet contains useful information).

    Example: Suppose you're looking for information on a Zip drive. You already have the drive but are having trouble with it (problems with zip drives? really? :-)* ). You do a search and you get a million sites that "guarantee lowest prices!" You could go to the manufacturer's site but they may downplay and misinform about *ahem* "certain problems" (1st zip drive gets click death. Get replacement. Feel ripped off because replacement is refurb. Vow to never buy anything Iomega ever agin when 2nd drive starts to click. Use harddrive to backup zip drive. Hope that it falls into some black hole in the bottom of closet).

    Of course I don't even bother, I just go straight for the LDP, but Windows users don't have that option.

    It would be interesting to be able to search only engineering sites for engineering information. I once did a search for "wheatstone bridge" and got tons of $cientology links. If the engine was able to determine if a site was, in fact, an engineering information site, that wouldn't have happened.

    How about a "no pr0n" checkbox. That would be sweet.

    Of course that would require a herculean effort in changing the standards and getting site owners to be honest.

    But maybe not. Here are some ideas I was thinking of:

    • How about manually specifying the categories for some major sites and letting the search engine crawl the links from them and categorising them as Linux sites. If you can get to a link within a reasonable number of links from a bona fide Linux site, it will also be a Linux site. It may also be a gardening site, and can be reached from gardening sites as well. Tada, two categories. etc.
    • Search result moderation. Registered users of the search engine can moderate search results. If you're searching recipe sites for a recipe (using that category thing above, remember?) and a pr0n site comes up, you can select "irrelevant, it's a porn site" and press the moderate button at the bottom.

    It's not perfect, but it's gotta be better than the garbage we put up with now.
    1. Re:Separation of commercial and non-commercial by steffl · · Score: 1

      "I once did a search for "wheatstone bridge" and got tons of $cientology links."

      you can avoid lot of 'wrong' sites in search results by using - (like "wheatstone bridge" -scientology). I am not sure how many search engines support it but altavista definitely does.

      "How about a "no pr0n" checkbox. That would be sweet."

      I bet "free pr0n" would be much more popular:-))

      erik

      --
      ...all excited, don't know why...
  167. $0.02 from a restricted access site's Net Druid... by rdewalt · · Score: 1

    The sites I maintain and append to in my day job are all "must register/login before proceeding" sites. (All of them are training/web based education sites, targeted audience specific) Each of these sites has it's robots.txt disallowing anything but the root page to be spidered across. In the root page I have stuck the primary "Mission Statement" (sans phb speak) in the keywords meta tag. In my case, I have no intention to let a search engine wander around the site, it would get lost in all the forms and frames. (no content flames please, I merely build what they pay me to build.)

    I often do quite a bit of searching, often for external reference verification. I never run a singluar search, but run over several engines. What I would recommend, is a way of grepping multiple sites (hotbot, altavista, yahoo... ) and presenting a scored list of which sites appeared in the first $X returns. I won't/can't limit myself to searching in just one location. I don't have 'site loyalty' to any one place. (I have not extensively used Google, so I am not sure if their methodology is just this.) "Specialization breeds disaster"

    'Spamdexers' (the term has been much heard of before, I doubt the need for redefinition) Is in my "Righteous and Elitistic" opinion, (I was called just that today, which may make this post borderline rant.) is one of the three "Punishable by death" crimes on the internet. (Spamming and Domain Squatting being the other two)

    But then, I believe there should be a compentency test before being allowed to use "The Internet", I remember gopher. I know there is more to "The Internet" than "The Web". I'm sick and tired of people who think that -just because- Geocities will give them free space, they -MUST- make a page. I'm tired of "target spam" that 'guarantees' to get me in the top 30 of a chosen search engine for $149.95, just because I have a .com domain.

    So at the end of a day, what do I know? Apparently nothing to my PHB's who want me to use "onUnload" to "make sure" that people don't accidentally leave the site... Next they'll ask me to alter back and forward histories and flood screens with popups for the other training sites.

  168. Is this proven? by Anonymous Coward · · Score: 0

    Are you sure? I don't find this to be the case at all with any of the searches I've conducted including those from open directory. And, I consider myself to be a professional power searcher. My biggest complaint about open directory is that many of the links are bad. I've had more success with askjeeves.com.

  169. Why databases? by Anonymous Coward · · Score: 0

    I think many people are missing the point of why databases are used by sites in the first place--to keep webcrawler/indexers out. Most sites have the robots.txt file to exclude these intruders. The webcrawler developers with have to find other means of indexing this information if possible. The direct approach isn't working nor will you convince most sites that restrict robots to let you in. Some with even use tactics like scrambling the url of your search based on your cookie session so if you bookmark a hit from you search, you won't be able to get that same hit from you bookmark in your next session. Devious isn't it?

  170. How about Universal Decimal Classification? by jukervin · · Score: 1

    I'we been wondering why none of the library classification systems have emerged on the net? Back in the good old days when I relied on the library for the information Universal Decimal Classification system was extremely handy. Even if you didn't know the name of the book you could browse thru a certain category that interested you.

    The idea is that a book can belong to a single class that is marked by a decimal schema. Top categories are:
    0 Generalities. Information. Organization.
    1 Philosophy. Psychology.
    2 Religion. Theology.
    3 Social Sciences. Economics. Law.Government. Education.
    4 (vacant)
    5 Mathematics and Natural Sciences.
    6 Applied Sciences. Technology. Medicine.
    7 The Arts. Recreation. Entertainment. Sport.
    8 Language. Linguistics. Literature.
    9 Geography. Biography. History.

    The main categories are defined further down:
    61 Medical Sciences. Health.
    62 Engineering and Technology Generally.
    63 Agriculture, Forestry, Stockbreeding,Fisheries.
    64 Domestic Science; Household Economics.
    .....

    and further and further:
    631 AGRICULTURE
    631.1 Farm Management
    631.15 Planning

    The classification would be used like KEYWORD meta tag in HTML and search engines would index it. This would enable user to specify word as well as the topic they are looking the information on.

    To prevent the misuse of the classification, only one or two classes should be allowed per page. Like
    "Marketing of agricultural products" -> 380.13:631
    (38 = Trade. Commerce. Communication. Transport.)

    UDC is language independed and it has already been translated to numerous languages. Also most libraries use some kind of numerical classification so many people are familiar with the concept

    To help page authors to classify their pages a special website could be created. It should contain at least

    • Information about UDC and why it should be used
    • Complete browsable UDC listing in various languages
    • Easy to use "wizard" that guides you thru the classification and spits out the correct HTML-tag.
    • UDC aware search engine
    • Petition list for other search engines to enable UDC classification

    How about it? Is it a good idea?

    One major problem in the matter is that the UDC classification is copyrighted. I couldn't find more than a skeleton listing from the web! So the first step would be to negotiate the licence for it or to the competing Dewey Decimal Classification. I don't think it would be wise to start building a own scheme without negotiations since both UDC and DDC are in extensive use. But if everything else fails, Gnu Decimal Classification to the rescue!

    More information about classification on internet see:The role of classification schemes in Internet resource description and discovery

    1. Re:How about Universal Decimal Classification? by otisg · · Score: 1

      The problem with this is spreading the word and making all web page creators adhere to this. A large percentage of web pages have no META tags and they've been in existance for years. A laaaarge percentage of web sites don't have robots.txt (not because they don't need them but rather because their webmasters don't know about robots.txt), etc. If you could make people (and companies selling web page design tools like HomeSite and Dreamweaver, etc.) agree that this is a good idea and then use it, this would be nice, but people can't agree on much simpler things, so...

      --
      Simpy
  171. Patentable Re:Why not use ScriptAlias by DeadSea · · Score: 2
    How clever. When somebody on that site requests robots.txt, it adds the IP to a robots list and the next time that you get the home page, it returns it to you without a session ID. In robots.txt there is a link that you can go to to remove yourself from this list.



    I think that would qualify for a patent. Go for it. Its a great idea.



    I just made the entire site unusuable by my entire company by viewing the robots.txt. How proxy server friendly. I hope nobody tries to look at the robots.txt file through an AOL connection.

  172. Re:ColdFusion bites - but get yer facts right by Anonymous Coward · · Score: 0

    ColdFusion may have some shortcomings, but lack of a switch equivalent isn't one of them.

    The main problem with CF is that it is targeted at Programming for Dummies^H^H^H^H^H^H^HWeb Designers, but that's also the biggest strength.

    rodgerd
  173. Usability Failure: Domain names and URLs by Ixany · · Score: 1

    As I see it, the only way to really make a planet full of data accessible to everyone sensibly is to take a step back, take a long deep breath, and take a closer look at our wobbling usability practices.

    We've abandoned some of the most important elements of our user interface in our rush to splatter the world wide web with our content: The domain name, and the sensible URL

    Domain names -- unlike heirachically organized things (Usenet is a good example) -- no longer really mean anything at all. They've been smashed flat into just a few heiarchies, largely so NSI could maintain fascist control of the few TLDs. It reminds me of the MS-DOS days, when everything wanted to install itself as C:\SOMETHING. Companies rush to register their word or words in several TLDs, fearful that their competitors may soon take away their opportunity to hold even a teeny slice of the narrow internet domain pie. Domains don't mean anything anymore. Not in terms of content anyway -- they simply don't help us find what we're looking for. Doesn't it seem obvious, or at least worthwhile, that our position, or "place" in the big information avalanche that is the internet should at least be related somehow to our content? Don't you wish people could find you directly that way?

    URLs have a place in this big scheme, too -- as a continuation of this structure. That is, URLs should represent information structure on a site in a manner simple enough not only for a person who is browsing to know and understand where they are and what they're doing there, but also to actually use as a user interface. When did we forget this crucial and human factor?

    The more we tailor our information to only be useful by machines, the more we ruin our ability as humans to traverse the internet in a way that's sensible and seems natural to us.

  174. Well congratulations by Anonymous Coward · · Score: 0
  175. An internet conspiracy? by Anonymous Coward · · Score: 0

    This thread kinda hit me... I've been thinking in the same lines for a long time. When the internet started, I used to be able to go onto the internet and do a search and get quite good results. Often a lot of information, but within it, there were always useful links. Then you started to hear on the news, and advertisements about people suggesting a "better" search engines. More accurate, giving only precise information... these projects got boosted with billion dollar support. As the internet, was the new medium of the millenium... and every IT company, with some "new" search type came up with a record high index on the nasdac or whatever :-) But me, the average user... can't for the life of me see how these buisness deals go about. Becase "I" don't get better results, I get worse results. Currently, when I do a free search... I don't get any useful information at all. And most of the links I get from free search, are links to emails and discussion forums. And going through the structural index, that these newly developed search engines have. Reveals nothing more than advertisements from companies, that have zero contents with the actual search being made. Despite the money being poured into it, and the "promise" of better results. The user gets less results, and companies apparently get a better stronghold on monopolizing what is and is not allowed to be on the internet. A person can no longer make a "link" to someones material... imagine the era, where people will be arrested and fined by talking aboout a mmaterial... because someone has a copyright on it. Or the time, when inventors are arrested for trying to "improve" concept or invention. Because they are illegally utilizing methods on copyrighted material. Since all material is copyrighted and patented, whose going to "legalize" what is being tought in schools and who is going to ensure that this material is correct and accurate? Since the students themselves, will be arrested if they try to discover it on their own, since after all they might be breaking a patent or a copyright by poking into it :-) I find it amazing, at how far people can actually go in robbing the people. Man isn't intellegent, man is a monkey... and not even a very smart monkey.

  176. Meta-Searching content providers by luserSPAZ · · Score: 1
    As much as I like Google, I tend to stay with Altavista, because I can easily exclude sex/porno terms from any search, using the minus sign operator. I have to do this with almost all web searches I do.

    Well, I think you should give Google some credit for its page-ranking system. A porn page is unlikely to come up at the top of your search for "3d game quake" just because it has those words (and every other one in the dictionary) in its META tag, unless a lot of other pages containing those terms link back to it, which would tend to make it a reliable site for that kind of information.

    I think Google has the right idea. Because of the proliferation of Porn/Business sites that will stop at nothing to get visitors, you can't quite trust a site to represent itself correctly for searching. Your best hope may be to keyword search, and then do a "background check" on the site to see if it really does provide that content. Maybe some sort of Meta-search which knows popular sites for different categories, and asks them to search for the results? i.e. you search for "linux program mp3" and the search engine knows that freshmeat.net knows a lot about "linux program" and mp3.com knows a lot about "mp3", so it asks those sites for search results, and displays those. It would put a lot more focus on providers of content anyway.

    Just my ideas,
    -Ted

  177. Hyper link Please.... by Anonymous Coward · · Score: 0

    Would you mind hyper linking that for us couch potatoes... thanks. www.npsis.net

  178. Export your database records by shermozle · · Score: 1

    This is a problem I came up against a few years ago and tried to solve at the time. I never got fart enough down the path to release a standard or anything, but I'll explain the thinking.

    Basically the problem is that search robots can get stuck in loops on your site if it's a database-driven one. Equally if the database contains something like stock quotes or postcodes it would just succeed in filling the engines with contextless gibberish.

    So instead, the plan was to get people to manually export their database into a flat text file referenced from the robots.txt file. The text file would in some way have a data field and an address field. So the data field has the content itself as plain text and the address tells the search engines where they should send people, rather than referring them to that text file.

    Now there's another big problem that the author hasn't mentioned. How do you find real-time information?

    Here's the scenario: You hear from a friend that a school in Dublin, Ireland has just been closed down due to sexual harrassment from the principal. Since this is your field, you want to find out more. The major sources: CNN, BBC, ABC etc aren't covering it. You know that some local new site will cover it though. So how do you find it?

    Right now, the only way would be to find the Brand in a subject index like Yahoo, then hope they cover it. Looking for Dublin Time might work. But why can't you search for sexual harrassment school dublin?

    The answer lies in a real-time database of news, requiring the news services to either update a file with all the news in it or perhaps in some way push information into the search engines.

    One approach to this problem is Moreover who index news themselves without the benefit of metadata. These guys are very clued in about metadata though.

  179. new HREF attribute(s) needed by peterw · · Score: 1
    "robots.txt" is terrible; not only does in not support regular expressions, it doesn't even glob well.

    Instead of asking robots to parse based on URL, we should have a new attribute for to indicate that the link could/should be followed. At the simplest level, this could look like INDEX="yes", but this could be extended in various ways, e.g. telling the spider if it needs to accept/send cookies, indicating a range of hours (in GMT) that the spider should restrict its queries to, etc.

  180. A Bigger Problem by Anonymous Coward · · Score: 0

    I think a bigger problem these days for search engines are the number of sites that actively try the spam the search engines (i.e. porn sites). A friend at Inktomi says this is a huge problem.

  181. human search engines seems interesting solution by nutsaq · · Score: 1

    incredibly inefficient, somewhat annoying, but it seems to work well...
    webhelp.com
    -nutsaq

  182. MARC and APA Style by ksuhr · · Score: 1

    I wonder if anyone has considered making an XML version of the MARC system that libraries use. Most library catalogs will let you search by items by author, title, publisher, standard number (isbn and so forth) date, type of material (book, cd, video etc.) and various other parameters, driven by the descriptive capabilities of MARC tags.


    Also, has anyone gone to including URLS & search strategies in works cited for papers and such? Will this become necessary?

  183. What Problem? by Anonymous Coward · · Score: 0

    If someone wants to create a site that isn't searchable, then that's pretty much they're problem. I have very little difficulty finding what I'm looking for using any of several search engines. But I'll admit that I may just not be searching for the "wrong" things. The real problem is spam-dexers. Those jerks that put every word in the dictionary in thier metatags in order go get you to their site, which has little or nothing to do with what you are looking for, and consists of nothing but banner ads (if they don't spawn a bunch of windows for you). Its more than a little anoying having to skip the first 5-20 items in a search to get to the real meat.

  184. Searchability should be the site's responsibility by Kenbo · · Score: 1

    Given the chaos that is the net, it is going to be tough for the search engine creators/programmers to deal with all the badly coded dynamic pages and properly index them.


    Why not have a standard (something like meta tags to enhance the search hits of your page content) to identify keywords for the page. Why not have a standard to indicate which parts of the site not to search (say a robots.txt file http://info.webcrawler.com/mak/projects/robots/rob ots.html) to indicate which parts of your site to not index. Put these and any other hints/standards in a public place and make it widely known that if you want the traffic a search engine can generate for your site, adhere to these guidelines.

  185. search engines, ? in URL, ISAPI filter solution by AHREF · · Score: 1
    wergild asks: "With more and more sites going to a database driven design, and most search engines not indexing anything that contains a query string in it, we're missing alot of content. I've also heard that some search engines won't index certain extensions like php3 or phtml. Is anything being done about this?

    Warning, this is an informational, on-topic, product mention.

    HREF Tools Corp. came out with an ISAPI filter to do something about this, yes. We called it the Coolness Layer because it makes a dynamic, even database-driven, site use "cool" URLs that omit the path, program name and ?. The filter redirects the incoming URL according to some rules and your application can keep working normally. Of course, it helps if you update your application to create "cool" URLs so that the created links maintain the illusion. We have the defaults all worked out so that this is easy for WebHub programmers (WebHub is our core technology), but it applies to anyone running IIS. And the idea applies to any web server.

    More info: http://www.href.com/coolness

    Version 2.1 supports multiple domains on the same machine, each with their own redirection rules.

    Enjoy.

  186. It is a matter of searching for the right thing by arikb · · Score: 1
    What the question asks is, actually, what if I want to look for words from the context of my search subject, and from there go up to the entire subject. For instance, I might be looking for songs by "Eagles", and specify "Welcome to the hotel California" as my search string. Database driven lyrics server would not reveal this information to the random robot.

    But, on the other hand, if I look for a subject using words that describe the subject (for instance - "lyrics", "song", "band", I would find the content search engine itself rather than the song, because the search engine should (and will) contain such words in its static parts.

    So IMHO there are two complementing and distinct solutions to the problem presented:

    1. Making sure our database driven content search engine describes itself to a good extent and with sufficient keywords for the indexing search engine spider to index efficiently for a typical query, and
    2. exporting an index of keywords in an HTML which redirects to the search engine, letting the spider pick it up and index it to its liking. Many xxx sites do that to increase their popularity.

    The first solution is obvious and should be widely in use today. The second solution puts the load on the spider, and seems "un-nettic" (ethic).


    Just my 2e-2$.

  187. A possible solution? by cr0sh · · Score: 1

    I don't know if this is an optimum solution, but here goes...

    I don't think we need search engines, but rather "search sites" - in many ways they would work like a search engine, however, they would lack the one property of a search engine that clearly cannot keep up with the web - the spider.

    The solution to the problem: Rather than having a spider go out and crawl the web to build the database, those sites wishing to be represented in the database should submit their site to the database. How would this work? Well...

    1. The site would submit their URL or "root" directory (in the case of personal sites) to the database for inclusion.

    2. Each site would only be allowed one URL/directory in the database. At that root level, would be the "index.html" page that should have its own search engine or links to the various parts of the site. It should have META tag info for the search site to use to generate descriptions for the page. Maybe this might be controlled with a "free" membership type thing, so that owners could change info about the submission.

    3. The site owner would have to categorize the entry himself - in other words, it would be the responsibility of the site owner to properly locate the link in its proper hierarchical context in the search database.

    4. Each link would be given ratings points - which users (maybe registered as members as well?) can use to "moderate" the site - so that sites that are in the proper spot and present good information get moderated up (to appear higher in the search results), while those in the wrong area, or those that have bad content (purposefully misplaced adult sites, commercial sites that have no good content) would get moderated down.

    5. Those sites with a consistent moderation rating less than 0 would, after a period of 30 days, be deleted from the database (maybe with an email to the site owner, so that he is warned).

    6. Searching could be done via a normal keyword interface or a hierarchical "click-n-choose®" interface (like Yahoo uses). Results and ordering on either method result from moderation points each sites have (so that top sites filter to the top).

    7. Use of a natural language interface for the searching would make searching optimal, but depending on available technology, may or may not be needed.

    I am sure I missed a few things here - please add on to the idea if you can. I think such a site could be run like /. is right now - with users/moderators and posters (of sites). I think such a thing could work - and would allow for a more complete indexing of the web (and allow sites that wish to be anonymous to stay somewhat anonymous), while giving a high QOS (due to the moderation - so you won't see a bunch of crap adverts, wrong info, or dead links).

    Does this sound reasonable?

    --
    Reason is the Path to God - Anon
  188. Searchability should be the site's responsibility by Kenbo · · Score: 1

    Given the chaos that is the net, it is going to be tough for the search engine creators/programmers to deal with all the badly coded dynamic pages and properly index them.


    Why not have a standard (something like using meta tags) to increase the relevance of keyword searches on your site. Also, why not have a standard (say a robots.txt file http://info.webcrawler.com/mak/projects/robots/rob ots.html) to indicate which parts of your site to not index. Put these and any other hints/standards in a public place and make it widely known that if you want the traffic a search engine can generate for your site, adhere to these guidelines.


    I would hope that people that would take the time to build a database backed dynamic web site would do the small amount of extra work to make sure that people could actually find the information through a search engine.

  189. rank links according to use by other searchers by decomp · · Score: 1
    One thing that might possibly improve search engines is a new layer of link-relevance ranking. Most search engines rank links based on how many of your keywords appear in the text associated with that link. Google has added a nice new wrinkle with their "how many sites link to this site" ranking. It seems that an other layer of ranking would be possible and useful, one based on the behavior of other link searchers: how many times a given link was chosen by someone else who looked for the same (or similar) keywords as you did. This won't work, of course, if most search queries are unique, but maybe most aren't, I don't really know. Regardlesss, it sure is amusing watching what people are searching for at search-voyeur sites.

    An aside about the changing nature of the web-wandering public:

    Wow, I've just come back from checking out the unfiltered (i.e. allows porn-associated searches to appear) metaspy voyeur site. Folks, I think the internet public may be changing. When I first checked this out for several weeks ~1 yr ago, most of the searches were porn related. This time, out of about ~100 search queries, I saw only a few of sex-related ones. Are things changing? That would be nice.


    ______________________(
    // ///#\)

    1. Re:rank links according to use by other searchers by crashdavis · · Score: 1

      I went out and watched it for a while (the unfiltered one, of course ;-), and I agree that the porn-oriented ones were only about 20% of the searches. The funny-in-a-sad-way thing was that about 10 or 15% of the searches were MISSPELLED! It's like no matter how freaking easy we (geeks) make it for people to find stuff on this 8 Billion Terabyte database we call the Internet, people will still find a way to NOT be able to do it. This whole discussion on search engines is going to be irrelevant if the planet continues its slide towards idiocy and illiteracy...

      --
      "The difference between theory and practice is small in theory and large in practice..."
    2. Re:rank links according to use by other searchers by haral · · Score: 1


      Ranking by how many users clicked on the link is meaningless... I usually try a lot of links, because I cannot decide just by the link if this is the corect site or not.

    3. Re:rank links according to use by other searchers by decomp · · Score: 1
      Meaningless? Are you sure? That's a pretty strong blanket statement. The way I see it is that the run-of-the-mill search engine (i.e., not google) returns 100s if not Ks of links to most of our queries. You say that you "usually try a lot of links." Yeah, I understand, I do to. But I don't try 100s, and I suspect that you don't either. Let's say you try 30 out of 100; that's a lot, but even so, you've winnowed out a huge amount of trash. Right away I see this as way more useful than simple keyword counting. Add to that the impact of other searchers coming along and further narrowing down among your searches, and we might be getting somewhere.

      A slightly more invasive approach could yield a lot, I suspect:

      1. user searches for keywords
      2. lots of links are returned
      3. user checks out 10% of the links (thereby registering votes for those links)
      4. when user clicks on a link, a small window pops up, created by the search site, and sits on top of the destination site. The window has a set of two radio buttons allowing the user to vote: relevant or not-relevant. These votes would count a lot more than the mere clicked-on votes.

      Sure, pop-up windows and voting are a pain in the ass. This approach would have to be done on an opt-in basis. But I suspect that the benefit of increasing the signal/noise ratio by orders of magnitude would make it seem worthwhile.

      So, I have to disagree with you that this approach is "meaningless." In fact, it might be one of the easier ways to add meaning without a major restructuring of the searching process.


      ______________________(
      // ///#\)

  190. Reintermediation by Phrogz · · Score: 1
    The Mining Co. (apparently now called "About.com") is the wave of the future, IMO--using intelligent people to try and distill down the data.

    I find it telling how recently this post and the one on saving /. from Natalie Portman/Trolls have come up together--both are the result of Too Much Information needing to be distilled down. /. turned to moderators. About.com is all about that, too.

    I first learned the term "reintermediation" from an article by Nicholas Negroponte in WIRED magazine. In another article by him (which I can't find right now) he says that while some people believe that librarians will be out of jobs, he predicts that there will be a new form of librarian. The old librarian could help you find what book you were looking for in the library. The new librarian will help you find the content you need from the 'net.

    As a personal aside, I'm shocked and dismayed that search engines don't index database queries--I *just* finished rolling out my new personal web site which is totally database-driven, and thorhoughly meant to be indexed. Now you're telling me that because all my URLs look like this "content.asp?nodeid=dejavu" that search engines won't find all the delicious content I'm creating? Botheration!

    I suppose *a* workaround here is to create an application which traverses the site and builds some mirrored heirarchy of it in static pages for the search engine to index, which uses JavaScript to bounce the user to the *right* page once they get there.

    *sigh*

  191. Altavista results are manipulated.. by smash_phase · · Score: 1

    Is it me, or are Altavista's results, deliberatly
    modified?
    I just notice some changes in behaviour recently, it seems like some contents is filtered out.. I can't really explain, I notice it with queries for various subjects...
    From guru techtalk to pr0n.. :)
    Maybe one of the indications is the "related searches" item.. Though this must be made up of frequent searched items, using those searches results quite often in few results..
    I take it, most searched for items are better haunted for by the spiders, or...?
    Well, it seems like search-engines in general are degradating... Where did the times go, that you didn't had to scroll through a searchengine-page looking for that much to tiny white box (not much bigger than the average picture for it's commercials), with not enough room to type in your query without scrolling?

    Just my 2 eurocents...

    --
    /* Be the change you wish to see in this world - Mohandas Karamchand "Mahatma" Gandhi */
  192. Database/dynamic isn't problem, FORMS are. by xanth · · Score: 1

    The problem is NOT with database/PHP/Zope/etc... driven site. They pose no problem to decent crawlers, since they behave like standard pages. Any dynamic html generation is done at the server, and is invisible to the client:

    You GET a url, and back comes HTML.

    The problem is with forms and javascript which happen to co-occur fairly often for obvious reasons with database driven sites. As long as url's (it doesn't matter if they end in .php or whatever) are provided as links, the crawler will have no problem in traversing them. But what about the target of a FORM 'action'?

    How can a crawler deal with forms and javascript? Javascript may not always be so bad, if the crawler can execute the javascript. But how does a bot fill out a form (in other words how will the bot generate an appropriate query string)? If a form is the only entry point to a collection of information, that information is currently inaccessible to crawlers.

  193. MERGING Domain and URL schemes by Ixany · · Score: 1

    I can't think of a single good reason NOT to merge the domain and URL structures.

    Pick a URL on this site, say /about_us. Use the new inverted-node notation.

    meta/web/design/companies/antistatic/about_us/ix any

    See how the URL blends into the heiarchy? That's *GOOD*. A given server somewhere should have control over a certain region of the heirarchy (I might serve from "antistatic" down in this example. I might even delegate some of it!)

    In addition, a redirect would make meta/web/design/com/ synonymous with meta/web/design/companies/ as an abbreviation. My email would be smtp://meta/web/design/companies/antistatic/ixany. My web page would be at http://meta/web/design/companies/antistatic/ixany. My resume would be at http://meta/web/design/companies/antistatic/ixany/ resume/. so if you typed: telnet://meta/web/design/companies/antistatic/obli que into your browser, you'd know where you're headed. I'd serve DNS for meta/web/design/companies/antistatic/ on down, and delegate everything for meta/web/design/companies/antistatic/oblique on down to oblique, so that you could mail an oblique.antistatic.com user at smtp://meta/web/design/companies/antistatic/obliqu e/username.

    That is, I'd have control over a certain node of a big tree structure. I'd *give* control of sub-trees to actual branches and leaves that make sense in an information sort of way.

    Our current URL scheme wants to specify a heirachy inside a heirarchy. But the problem is it must make obvious the fact that the outside heirarchy takes one kind of query to provide, and the inside heirarchy takes another. But it's no longer useful to separate these -- it's just a way to organize trees of information, after all, in which one tree is rooted in another. It seems more and more that should be transparent, and that brings to the table other issues, such as the current lack-of-information provided by current use of the domain name system.

    This looks like a lot of typing -- but look how you can get to information directly! That's a huge win. Also, the heirarchy could be browsed from the top on down, getting closer to where you want to be in sequential steps, rather than the search engine paradigm where you're often getting a lot further away as you go.

  194. Site Design and Content searching by BitS · · Score: 1

    Perhaps this is not a search engine problem at all? Sounds more like bad site design. A database is meant to store data that is changing or needs to be searched often. Anyone who designes a static site using a complete database backend with cgi,php3,whatthefsckever is acting alot like Microsoft. Adding fluff for absolutly NO reason, and only getting back more bugs and slower performance while increasing the required resources.

    Pages with dynamic content, pulled from a database, should not be indexed in the first place, they are not static, they may change the INSTANT the search engine is done with the page, so how is the search engine supposed to return predicatable results?

    Finally, if you can't figure out how to change your extensions so that index.html is interpreted as a php3 script, well thats your own problem, become a real admin and that won't effect you. If your not the admin, your ISP needs to get a clue and help you.

    If you expect to use content searching, your suggesting that the page has some content metatags or the like... this is fine and dandy, except I doubt you'll agree with what everyone else considers a "content" type... for instance, you search for "Adult Art", possibly expecting back some ART(not porn), instead you get 2.6 million entries for www.hardporn.com... that just doesn't work. That and the fact that businesses will just put every day keyword they can think of in thier page, so you find it in searchs that are completely unreleated.

    Perhaps I'm acting as an eleetist, but this IMO is what happens when you have 20,000 MCSEs that THINK they know how the internet works, go grab FrontPage and ColdFusion and write database based websites all day long, with completely static content. All because they are too lazy to index the site themselves. I'm GUILTY of this myself, my website is entirely database driven, most of the content in the database will NEVER change. If I put a little more time into it, it would be rather easy to write out old information to static html on a regular basis, allowing those pages to be RELIABLE searched by the global search engines.

    Just my $0.02

    --
    http://www.schizo.com/
  195. Here's my approach by Anonymous Coward · · Score: 1

    I've found this technique very effective:

    I use apache and mod_rewrite so I can URLs that look like static web pages to the everyone (search bots included), but are really database backed dynamic data.

    Nightly I generate a heirarchical static of all the dynamic pages on the site that I want indexed. A link to this index is located on every page on the site by an invisible HREF (one that bots will pick up on, but people with browsers will never see). These faux-indexes contain just href's (invisible) to all the pages on the site I want indexed, but the indexes contain absolutely no text because I don't want the bot index to be indexed. (I also use the ROBOTS META tag on these box indexes to keep them from being indexed themselves)

  196. One tiny piece of advice that's never followed by Anonymous Coward · · Score: 0

    You can take care of filtering by file extension very easily: don't put the file extension in the URI. It's a very bad idea; it effectively puts the type of the object in its name, so if you ever go to change the type (as GIF to PNG) or the server handling (as .html to .shtml to .asp to .php) you have to change all references to it. And Cool URIs don't change.

  197. Google does a fine job with this... by Anonymous Coward · · Score: 0

    Searching for device driver on Google gives many fine results.

  198. Don't Like It? Hey, build it yourself. by HerrNewton · · Score: 2

    Standard disclaimer: IANALG (I am not a Linux geek.) Rather, I'm a web design geek. So please, be nice.

    From what I understand, what a lot of the OpenSource movement is about is doing it yourself if you don't like how it's being done now. Don't like commercial Unix, Linus? Make your own fscking Linux and let everyone contribute. Oh yeah: Give it away free to really piss people off.

    In this discussion, there are a ton of excellent ideas for how search engines should operate. Yet no one, to my knowledge, has put forward the next logical step: Build our own search engine. Google is a good start but hey, I know you guys could build it better. Worried about hardware and bandwidth costs? Venture capital.

    As I said, do it yourself. :-)

    --

    ----
    Am I the only one who thinks Microsoft is a misnomer? Perhaps Macrosoft would be a better fit?
  199. Web of trust: Bibliography by rho · · Score: 1

    This is more a concept than an answer, but my thinking on search engines and their ilk is that they are becoming (if not already are) useless. 800 million web sites? Two billion? Twelve trillion? How far will it go? Who knows? We could eat up every MIP of processing and every bit of bandwidth trying to keep current in search engine indexing... and in the end, you'll have a mess.

    The answer? I don't know, but I have an idea. Berners-Lee talked in his book about a "web of trust" -- mostly talking about security and e-commerce and such -- but the concept can be expanded to apply here.

    For example, I trust /. to provide me with useful, timely information, and act as a great resource for all things nerdly. If /. provided a search engine for a few specific sites that the /. content owners felt were worthy of inclusion, I'd use it quite a bit. /. becomes responsible for maintaining those connections, and monitoring the output to ensure relevancy. The outside content owners provide hooks into their data, tailored to the idiosyncracies of the /. community (plenty of RMS, no MSG).

    Censorship? Depends on your definition. If you trust /. to provide good info, you also trust (implicitly, if not overtly) their editorial judgements. It's a human-to-human connection, facilitated, not replaced, by the computer.

    I liken it to a bibliography. When I do dead-trees library research, I like to find the appropriate section, pull a book down and skim it. If it looks appropriate, I'll then check the bibliography to see what other books the authors found appropriate. Hey, they've just done research for me! Neat! Go to those books, the books under those, etc. I now have a web of sources, all culled from a (basically) random book pulled from the shelves.

    Expanding that to the Web, /. trusts theOnion to provide the latest in useful headlines around the world (I know I do...). The Onion provides, through a "Bibliomatic" link (TM, (c) me me me me ... are you reading this Amazon? :), hooks into their data with published calls to pull appropriate, timely keywords relating to their content, with the ability to search archived content as well.

    Everything goes swimmingly, until The Onion IPOs and starts to be run by MBAs, and the content-o-meter drops to zero. Mr. Taco gets innundated with a million emails complaining about how the /. results for The Onion all return "Make $$$ Fast with 18 year old transvestites having anal sex with dogs". Taco dumps all calls to The Onion's content, fires off a letter threatening Armageddon, informs them that they are off the list, and starts using somebody else.

    Some things we got right the first time around. Car doors that open forward (not up), radios with volume dials (not tiny, fiddly buttons) with a real potentiometer behind them, and bibliographies. Those search engines that don't incorporate at least some aspects of this become obsolete, or relegated to searching for obscure content.

    What about mailing list archives? Those are GREAT resources -- better than the FAQ usually. Getting to that data is more problematic, but not impossible. You can index the subject lines and provide a hook to that -- if /. chooses to use that hook, great. You can index the whole mess and force content owners to search it -- which will put you on the blacklist pretty quickly when people get a jillion results that all have "Im hafing problems wif Winblows" as their title. Or you can send a link with each possibly appropriate query to your own search engine that will locally search the mess.

    I think the One True Search Engine is a pipe dream. As for myself, I try Google, Yahoo, and AltaVista in that order. Most of the time, I'm looking for someplace I've already been, and can't remember the URL, so I can customize my query to bring that particular site up in the rankings. I've tried doing pure general searches, but I'm always daunted by "Your querey returned 12,486 results." Yeah, right.

    For more on-topic information, try Philip's book. He had the same problem as discussed here, and he solved it with a few lines of TCL code inside AOLserver. That's the short answer...

    --
    Potato chips are a by-yourself food.
  200. XML-RPC by DHartung · · Score: 3

    Don't overlook XML-RPC, which builds on the XML spec to provide a way of serving data over the web to remote clients.

    Then there's RSS, which is a way of serving up a news channel or other changing data. These applications are here and in use. Together, these XML-based technologies will someday provide the data layer for the software agents of the future. Read lately about that new "price-checker" technology? Imagine being the one business that doesn't serve up your product list and pricing to that agent.

    An interface from XML to these "hidden" databases is only a matter of time. We're just caught right now at a moment between technologies: the authoring tools don't really exist.
    ----

    --
    lake effect weblog
    {Network engineer in Chicago--looking for work!}
  201. Apache Directive Work Around by BlueTooth · · Score: 2

    PHP Builder ran an article describing how you can have Apache Webserver treat a certain "directory" as a script, using the Location directive. So if I had a script file name called www.mydomain.com/foo then I could access www.mydomain.com/foo/param1/param2 and the foo script would run, and could use environment variables to find the "path" foo/param1/param2. I tried it, and it works quite well. This hides get parameters as "paths" so that search engines don't think the pages are dynamic (this is how Amazon.com works)

    --
    SPAM
  202. Re:Specialized Engines - an example by RobRuminant · · Score: 1

    It's already started, in some categories.
    Mostly media searches so far, but it will expand. An example I found the other day: Sourcebank, for code and research papers.

    The next problem will be finding all the different specialized search engines. But surely someone will make a search engine for that. :)

  203. Don't worry about this by Rainy · · Score: 1
    This article implies that there's a tendency to move all sites over onto database. That's simply not true. Vast majority of sites don't change often enough to justify that. There's a certain threshold after which it makes sense to spend money and effort on converting to database-driven site but only ~ .5% or much less sites lie beyond that threshold. For the rest of them, going database would be a waste of resources.

    On the other hand, you don't want to index database driven sites at all. First of all, that'd be impossible technically. Best search engines currently index something like 25% of the web if not less, and are able to re-check these pages only once in a month or so.

    The practical solution is to have a database site (if necessary) that only uses database for dynamic content. IE if it's a /. FAQ page or ABOUT page or something else you want to be searchable, make it a static page, while articles/comments should be dynamic.

    --
    -- ATTENTION: do not read this sig. It doesn't say much.
    1. Re:Don't worry about this by BlueTooth · · Score: 1

      Yes, but of the ~.5% (I don't even know how accurate this is) you have some of the largest sites with the most information. I agree that is is useless to index dynamic pages (like a poll, or shopping cart), but a lot of database driven sites just use a database to ease the management of a whole lot of esentially static content (as web servers have spead up, you see less and less pre generation, and more and more pages created on the fly). The intermidiary is to get search engines to try to handle "dynamic" content better, but in the long run, it would be nice if a database driven site could provide hooks to its own search engine. Then, you could categorize a whole bunch of dynamic content with keywords. If a user searches for something matching those keywords, then the original search engine passes the search to the databased site's search engine, and displays the results as if they came from the main search engine in the first place.

      --
      SPAM
  204. PHBs vs Web Developers by JamesKPolk · · Score: 1

    A similar situation occurs when PHBs think their site doesn't ``look'' quite as good as others. (Insert my usual rant about content vs. presentation here.) Whether via a hideous HTML-abusing web authoring program, or via all sorts of hacks that God never intended to appear in anything resembling SGML, the HTML landscape is changed there as well, and SearchEngineInc's product becomes less effective.

    Oh, I wouldn't blame that solely on PHB types... Every time on slashdot that someone points out that HTML was meant for a logical or content-based tagging system, 3 people pipe up and say "But you can't get a good looking site that way!"

    It's the other way around. If nobody had ever abused HTML, and Netscape and Microsoft extenstions of HTML, no PHB would ever have known that a web page could be a graphical monster.

  205. Re:Shouldn't we use the right extension for the fi by Mike+A. · · Score: 1

    This is a bit offtopic now, but d*mn, I'd love to be able to retrieve Slashdot comments via XML. Then I could reformat them to taste, and e.g. lose those horrible ugly colors they're now using for Your Rights Online stories...

    --

    --
    Do I look like I speak for my employer?
  206. Some comments by wheezy · · Score: 1

    Okay, I know a bunch about this, since it's an important part of what I'm studying... Pay attention.

    First of all, the Internet -- porn or not -- is growing at an absurd rate. Any single search engine, regardless of how good its ranking algorithm is, will not be able to keep up either with new or more difficult to use technologies (such as databases, as the post mentions). Some of my research is directed towards the idea of distributed indexing. I can't get into it now, but imagine Napster except with metadata instead of MP3s. These distributed mini-engines would know how to answer very specific queries (some would know how to deal with databases, some with PHP, some with other mini-engines, and so forth). It's a pretty complicated idea, and has some problems (such as response time for searches), but is the only real scalable solution for the growing Internet.

    Second, XML is becoming more prevalent on the Internet in general (see Apache XML), but unfortunately is not quite there yet. However, as a poster alluded to, RDF (an XML flavor used to describe site metadata) is usable today. The state of RDF, however, is that it's currently used more for the purposes of Slashboxes for example than web spidering.

    Anyway, be sure to keep an eye out. Expect things to change dramatically in the next year or so. The Internet is still a baby, and it's just now learning to walk...

  207. Minor cheapshot, admittedly by Chris+Johnson · · Score: 2

    ...but you made an elaborate site entirely dependent on Microsoft Active Server Pages, and you're expecting it to work with _any_ web standards, much less be indexable and spiderable as if it was proper HTML? I'm afraid that you stepped right into that one. Look on the bright side- were it not for this Ask Slashdot article, you might never have known you weren't indexable, as this seems to be a little known fact! That alone is rather shocking.

  208. Re:Searchability should be the site's responsibili by medcalf · · Score: 1

    While I agree with your central point, I think that it goes a little beyond this.


    Any scheme which relies upon a central authority or site to manage or index web content will fail because the web gets too big too fast. This means that sites must be responsible for identifying themselves, and in a distributed way.


    One way to do this would be to have a DNS-like system of distributed "web-index" servers, which describe the sites and their content as known to that server. Then, each of the web-index servers could gain information from local web pages, and report it up some heirarchy with a known root. You would then be able to find sites by content type, by specific knowledge areas (Dewey decimal website classification?) or whatever, depending on how the standard is defined.


    This has the advantage of distributing the load, increasing the likelihood of finding what you want quickly, and allowing easy site hiding (by obscurity) for non-general-use sites.

    --
    -- Two men say they're Jesus. One of them must be wrong. - Dire Straits
  209. Hard to control by spuk · · Score: 1

    So maybe this means that it is very difficult for someone ever have some sort of control over it. Which I think is good.

    --

    "Video bona proboque; deteriora sequor." -- Ovid
  210. News search engines (Re:Database driven web pages) by MattJ · · Score: 1
    "Just to illustrate this take any given news site. Today they might have articles about Clinton, tomorrow it might be news about a big fire. Search engines can't just direct you to those sites based on queries because who knows what data they have."

    Well, what you need is a search engine for news. One that is constantly crawling news sites so you can search on "Clinton" or "fire" and get current or recent results.

    Shameless plug: My own site, NewsBlip.com, is just coming out of beta now. Fast searching now, more features coming. Built on Open Source (Apache, PHP, etc.). End of shameless plug.

  211. A combination of three ideas... by The+Viking · · Score: 1

    I'm no expert, but three things come to mind: 1. An open standard that defines ways that content is indexed, and defines a standard interface to the indexed content, no matter how that data is stored (i.e. relational database, XML, text, etc...). 2. A new language for searching the indexed content (in much the same way that SQL was developed to access relational databases). 3. A distributed system which allows each site to be authoritative for the content of the site (much like DNS). Each site could be responsible for providing a "search server" which would expose the standard interface to the indexes mentioned in my first idea. There could be "root servers" that are specific to certain types of content, where each root server could refer clients to the "search servers" that expose the type of data the client is seeking. This has the advantage of distributing the processing load, which should allow it to scale well. Am I totally out in left field or what? It just seems like we need a basic paradigm shift from the current [klunky] search methods. I see parallels between the problems that led to DNS, and the problems we face dealing with the rapidly growing quantity of available content on the Net.

  212. Re:Distributed Databases? -> Mobile Agents by Anonymous Coward · · Score: 0

    Or get the info from everybody.
    With agent-technology, you can provide your agent(s) with information you like. Your agent will negotiate with other agents and come back with results. According to your rating, the provider of those results is rated higher or lower, resulting in some sort of social context.
    The advantage of this is that you don't rely on the contents of pages (which can easily be modified to provide a maximum of hits, while not being related), but on the opinion of others.

    The beta has just been released on www.tryllian.com

  213. Index by type of content by Syberghost · · Score: 1

    Much as I hate to start a new topic here, I should point out that the original poster's suggestion that we index sites by the type of content they provide, instead of the actual content, is called "yahoo".

    Been there, done that, and it's occasionally useful, but usually not.

  214. Dynamic content isn't the only problem by Darth+Null · · Score: 1

    Ultimately, Web-wide searching will fail, and not just because of database-driven Web sites. There are a number of reasons why:

    • The Web is too big and it may be growing faster than new search technologies can be developed to keep up with it. Last I heard, even the largest index had only catalogued about 20% or so of existing sites, and about 25% of the Web was unindexed by any engine.
    • The Web is too diverse in terms of subject matter. There exist bibliographic databases containing hundreds of thousands or even millions of citations on a topic as narrow as AIDS; Web search engines index every subject imaginable. This means that keywords that have meaning in multiple subject areas lose a lot of their value.
    • Most Web pages are noise. They aren't properly marked up to denote their structure, so an indexing program can't differentiate the main headings from body text, so a term mentioned in passing is given as much weight as a term in the main heading of the document. Also, most authors fail to include any meta information, let alone adhere to a common standard like Dublin core.
    • Most documents aren't static anyway. They appear, dissapear and change. Few people use the robots exclusion protocol to identify these. Also, many if not most documents on the Web are navigational documents; indexers don't really differentiate between navigational pages and content-bearing pages. Sometimes, you want to find someone's Web site and other times you want to find a specific piece of information or the location of a resource like "today's headlines", but in a search engine, they're all intermixed.
    • Many useful resources are in a format like Adobe PDF or JPEG that are not keyword indexable. To some extent, replacing these formats with XML will solve this problem, but only if XML is widely adopted and properly used.

    Attempts to index the Web can best be described as an attempt to index and catalogue the largest, most diverse and most frequently-changing collection of documents, which adhere to no common standards of self-identification or description whatsoever, by people who generally have no training or experience in cataloguing or indexing for people who generally have no training or experience in database searching, and hoping that somehow, everything will work out.

    That it has worked this well up to now is a testament to the creativity and ingenuity of the developers of indexing and search engine technology. Still, when compared to a professionally-organized database like a library catalogue or bibliographic subject database, the Web's search facilities are incredibly primitive and while it's easy enough to find a known item, it's basically impossible to do an extensive subject search or even to find a few of the most relevant resources on a particular topic without already knowing what those resources are.

    Eventually, we're going to have to rethink how we index the Web, and this will involve making decisions about breaking the Web into manageable pieces and deciding exactly what kinds of things we want to catalogue/index in the first place.

  215. Junky commercial sites screwing up search engines by BeanThere · · Score: 1

    Personally I think the biggest problem working towards making search engines useless is junky commercial sites that offer nothing worth my while. Type in a few keywords to find something and end up with a few hundred:
    (a) dead links
    (b) rubbishy commercial sites that arent related to what you're looking for
    (c) home pages that look like exactly what you're looking for - at a glance - but turn out to contain less than a few scraps of useful stuff.

  216. web search misspellings by decomp · · Score: 1
    The frequency of misspellings in searches is large part of this amusing (old) article on the magellan voyeur...

    "A mere half-hour's perusal of the Voyeur turned up one sad goof after another:

    super modles
    sex weied
    streaptease
    sesamie street
    necked asain women
    wallsreet journals

    and my favorite --

    christian boardcasting network"

    I suspect that the percentage of sex-related searches is correlated to time of day and day of week. The low percentage I witnessed probably had something to due with the fact that I was checking mid-day Tuesday. I bet it goes up a lot on Fri & Sat, especially at night.


    ______________________(
    // ///#\)

  217. simple confusion and corruption by serialk · · Score: 1

    the programs and indexers get confused and are corrupted by junky crappy sites. the web in a certain aspect has become unsearchable because of sites that claim they have "unbiased" links when many of them are paid ads. the web was built on quality links and it still is mostly. you can find what you want but with paid searches you are limited to what they give you which is horrible and every time you are you should be specifically told that you are or else they are deceiving you and this is something which has been going on for years.

  218. Obsolete Search Engines by olkeith · · Score: 1

    Traditional search engines don't work for many reasons. First of all they can't keep up with the staggering expansion of the web. Second, they seem to be more interested in ad revenue than performing a service. I for one am sick of banners and ani gifs, so I did smomething about it and designed a business card engine using Linux, php and mysql that is user supported and cheap! Now I need the support of the Linux community to help kick it off and provide meaningful feedback/comments. www.cards411.com

  219. Re: count every clicked link? - NOT by Anonymous Coward · · Score: 0

    Count the *last* link the heaviest.

    It's where either the person got fed-up, or found the information they were looking for.

    -- Ender Duke_of_URL