Slashdot Mirror


Is the Internet Becoming Unsearchable?

wergild asks: "With more and more sites going to a database driven design, and most search engines not indexing anything that contains a query string in it, we're missing alot of content. I've also heard that some search engines won't index certain extensions like php3 or phtml. Is anything being done about this? How can you use dynamic, database driven content and still get it indexed into the major serach engines?" Is keyword searching obsolete? Do you think its time to index sites by the type of content they carry rather than the content itself? Will larger indexing databases (or a series of smaller, decentralized ones) help?

1 of 313 comments (clear)

  1. Spider traps ... by charlie · · Score: 4
    Many years ago (1994? 1993?) I wrote a web spider. (Crap back end, though, so I dropped it. The bones are on my website.)

    Some time later, it occured to me to try and monitor the efficiency of web indexing tools using a spider trap.

    The methodology is like this:

    • Write a perl module (or equivalent) that generates realistic-looking text using Markov chaining based off a database. Text generated should be deterministic when seeded with a URL.
    • Write a CGI program that uses PATH_INFO to encode additional metainformation. Have it eat the output from the text generator and insert URLs that point back to itself, with additional pathname components appended.
    • If the spider follows a link it will be presented with another page generated by the CGI script, containing text generated by it in response to a hit that differs in a repeatable manner from the text in the original page.
    • Child pages should contain links that point inside the web site; you could do this by making the CGI program the root of your "document tree". Better yet, run multiple virtual servers and include URLs bouncing between the domains -- all of which are mapped onto the same script.
    • Stick this thing up on the web and wait for the crawlers to come. They will see a tree of realistic-looking HTML with internal links, digest, and index it.
    • You can now analyse your logs and monitor the robot's behaviour (e.g. by changing the type, frequency, and destination of links your text includes). You can also search the search engines for references back into your document tree and work up some metrics to measure just how accurately it's been indexed (e.g. by re-generating the text of a page and feeding it to the search engine and seeing what comes back -- which words are indexed and which are ignored).

    Anyone done this? I'm particularly interested in knowing how spiders handle large websites -- have been ever since I was doing a contract job on Hampshire County Council's Hantsweb site a few years ago and caught AltaVista's spider scanning through a 250,000 document web that at the time had only a 64K connection to the outside world. (Do the math! :)