Slashdot Mirror


Vector Space Search Engines in Perl

cascadefx writes "There is a great article on writing a Vector Space Search Engine in Perl over at the Orielly Network's Perl.com site. Most search engines, such as Google, use a reverse keyword index algorithm. The article, complete with code, proves yet again that there is (definitely) more than one way to do it. Try your hand. Add improvements and you may have a handy search engine for your site or... the next Google-killer."

4 of 20 comments (clear)

  1. no good for large collections of documents by rpeppe · · Score: 4, Insightful
    As far as I can see, this algorithm runs in time proportional to the number of documents, which might be fine for a small number of documents, but would be abysmal in large environments.

    It's a pity the article doesn't make this clearer (instead, it just says "Searches take place in RAM, so there's no disk or database access", which makes it sound as if it would be faster!)

    There's a reason people invented indexing schemes!

    1. Re:no good for large collections of documents by FamousLongAgo · · Score: 4, Insightful

      [disclaimer: I wrote the article]

      Since when is linear scaling abysmal? Keep in mind that for very large collections, you can use clustering or other techniques to reduce the number of comparisons. You essentially layer an indexing step on top of things, so that you only search the most promising vectors.

      Keep in mind also that vector comparisons are very, very, very fast. People have run vector space search engines on collections in the millions without running into performance issues. And the claim that vector search is faster than a database search is true, for as long as you can fit all the document vectors into RAM.

      --

      A customer service representative will be with me shortly.
    2. Re:no good for large collections of documents by t · · Score: 3, Insightful
      Skimming the article, it seems to be quite similar to Self-Organizing Maps or Vector Quantizers. Care to comment?

      Also, I've worked on SOMs previously and there is no way you can hold all the document vectors in RAM. Think google, when you do a search on 3 million documents. The part that really hurts is the number of possible keywords.

    3. Re:no good for large collections of documents by t · · Score: 2, Insightful
      I hadn't heard of either technique
      Sadly that is often the case. History is riddled with examples of mathematicians duplicating discoveries. That is my motivation for studying these types of techniques, a way to better compare scientific papers, to root out similarities. The history of the wavelet is a good example to discuss. The theory was almost hit upon several times, in slightly different fashions but the researchers involved never knew that others were making similar discoveries.

      As far as SOM stuff, Kohonen is the man.

      There are indeed many things that can be done with SOMs, but to get really good, human-like, results is quite difficult. The problem expands when you need to consider different words for the same thing (e.g., cat/feline), and exclude words that appear identical, but aren't. Then the need to consider context becomes increasingly important (e.g., this article is not about cats). The complexity builds quite quickly.