Slashdot Mirror


Vector Space Search Engines in Perl

cascadefx writes "There is a great article on writing a Vector Space Search Engine in Perl over at the Orielly Network's Perl.com site. Most search engines, such as Google, use a reverse keyword index algorithm. The article, complete with code, proves yet again that there is (definitely) more than one way to do it. Try your hand. Add improvements and you may have a handy search engine for your site or... the next Google-killer."

4 of 20 comments (clear)

  1. no good for large collections of documents by rpeppe · · Score: 4, Insightful
    As far as I can see, this algorithm runs in time proportional to the number of documents, which might be fine for a small number of documents, but would be abysmal in large environments.

    It's a pity the article doesn't make this clearer (instead, it just says "Searches take place in RAM, so there's no disk or database access", which makes it sound as if it would be faster!)

    There's a reason people invented indexing schemes!

    1. Re:no good for large collections of documents by FamousLongAgo · · Score: 4, Insightful

      [disclaimer: I wrote the article]

      Since when is linear scaling abysmal? Keep in mind that for very large collections, you can use clustering or other techniques to reduce the number of comparisons. You essentially layer an indexing step on top of things, so that you only search the most promising vectors.

      Keep in mind also that vector comparisons are very, very, very fast. People have run vector space search engines on collections in the millions without running into performance issues. And the claim that vector search is faster than a database search is true, for as long as you can fit all the document vectors into RAM.

      --

      A customer service representative will be with me shortly.
    2. Re:no good for large collections of documents by t · · Score: 3, Insightful
      Skimming the article, it seems to be quite similar to Self-Organizing Maps or Vector Quantizers. Care to comment?

      Also, I've worked on SOMs previously and there is no way you can hold all the document vectors in RAM. Think google, when you do a search on 3 million documents. The part that really hurts is the number of possible keywords.

    3. Re:no good for large collections of documents by FamousLongAgo · · Score: 3, Informative

      I hadn't heard of either technique - if you have some links, I'd love to look at them.

      You're completely right about web-size collections being too big to hold in RAM. It's a question of defining the problem space. My article was targeted for the hobbyist, who is far more likely to be building a search engine for a website (thousands of pages) than for the entire web.

      Presumably, if you're trying to index the web, you won't be writing your search code in Perl ;-)

      That's not to say you can't scale vector-space search engines. But it makes things complex. There is a lot of handwaving in the article, to try and keep things simple. One example, as you point out, is neglecting to mention practical ways of reducing vector size (number of keywords). That problem goes away if you use latent semantic analysis, which reduces the dimensionality of the vectors by several orders of magnitude. And as a nice side effect, you get a search engine with better recall, that can give you relevant results without a keyword match.

      But LSI has its own scaling issues, and those scaling issues have their own workarounds, and so on and so on. Put it all in an introductory article, and it gets overwhelming.

      Still, I think there's room for novel algorithmic approaches in the 100-100K document range. To give you one example, we run an LSI search engine on a database of old Civil War articles for the University of Virginia. There's no way the search could scale to ten million documents, but it doesn't have to. The collection is bounded.

      I suspect other people have interesting collections in that kind of size range. Not everything has to scale to the billions of documents to be useful.

      --

      A customer service representative will be with me shortly.