Search Engines Can't Keep Up
joshwa writes "The Boston Globe today reported a study by Nature saying that search engines barely index one-sixth of the pages on the net. To a certain extent it's a plug for the Northern Light search engine, which claims to be the most comprehensive (at a staggering 16 percent of the web), but it's an interesting read nonetheless. "
The percentage of indexed web sites is small, but the amount of data that represents is pretty staggering. Unlike an encyclopedia or other reference book which can cross reference between the a concept in the index and a number of appearances of the concept in the body of the text a web search engine has a much harder job (as do people trying to use the search engine). For an encyclopedia some person does the job of indexing things with an understanding of context, so for instance 'green' in the index would be referenced to entries on 'colours', 'the spectrum' but not 'grass'. The web search engine blindly returns every instance of the word 'green' with no regard to context. So if the person was actually wondering how to make 'green' with his box of crayolas (since his sister ate every shade of green in his box of 64) he'd either have to wade through each site till he found what he was after or choose a better search term.
Machines aren't very good at being intelligent in this manner, so suppose a new search engine was created. You type in a search term and it comes back with a list of matching pages. You again wade through the list but now you also can award a number of relavence points to the ones that matched closest. This would work well for a while, but would break down in the long run, as the web continues to expand new pages will be unranked, so they would not appear in the ranked lists of potential hits (at least for popular search terms) and so won't be ranked.
What might work better would be a search by reduction. Type in some overgeneralized search term and the text on the page is distilled down to a brief outline. There are already packages which can create fairly decent summaries of documents. You click on a button that indicates "I like this, find me more like it" which means that there's something you like about the summary so it generates a number of new more specific search terms from the summary and comes up with a new list.