Vector Space Search Engines in Perl
cascadefx writes "There is a great article on writing a Vector Space Search Engine in Perl over at the Orielly Network's Perl.com site. Most search engines, such as Google, use a reverse keyword index algorithm. The article, complete with code, proves yet again that there is (definitely) more than one way to do it. Try your hand. Add improvements and you may have a handy search engine for your site or... the next Google-killer."
hmm i suggest you all have a read of this laymen's guide to latent semantic indexing, the vector model on steriods as the author puts it.
one killer feature of this is...
Quote "Looking for articles about Tiger Woods in the same database brings up many stories about the golfer, followed by articles about major golf tournaments that don't mention his name. Constraining the search to days when no articles were written about Tiger Woods still brings up stories about golf tournaments and well-known players." -- bloody amazing and relativly simple to do.
and it seems to be scaling very well at the moment..
I hadn't heard of either technique - if you have some links, I'd love to look at them.
;-)
You're completely right about web-size collections being too big to hold in RAM. It's a question of defining the problem space. My article was targeted for the hobbyist, who is far more likely to be building a search engine for a website (thousands of pages) than for the entire web.
Presumably, if you're trying to index the web, you won't be writing your search code in Perl
That's not to say you can't scale vector-space search engines. But it makes things complex. There is a lot of handwaving in the article, to try and keep things simple. One example, as you point out, is neglecting to mention practical ways of reducing vector size (number of keywords). That problem goes away if you use latent semantic analysis, which reduces the dimensionality of the vectors by several orders of magnitude. And as a nice side effect, you get a search engine with better recall, that can give you relevant results without a keyword match.
But LSI has its own scaling issues, and those scaling issues have their own workarounds, and so on and so on. Put it all in an introductory article, and it gets overwhelming.
Still, I think there's room for novel algorithmic approaches in the 100-100K document range. To give you one example, we run an LSI search engine on a database of old Civil War articles for the University of Virginia. There's no way the search could scale to ten million documents, but it doesn't have to. The collection is bounded.
I suspect other people have interesting collections in that kind of size range. Not everything has to scale to the billions of documents to be useful.
A customer service representative will be with me shortly.