Slashdot Mirror

← Back to Stories (view on slashdot.org)

Vector Space Search Engines in Perl

Posted by michael on Wednesday February 26, 2003 @04:35AM from the never-thought-linear-algebra-would-come-in-handy dept.

cascadefx writes "There is a great article on writing a Vector Space Search Engine in Perl over at the Orielly Network's Perl.com site. Most search engines, such as Google, use a reverse keyword index algorithm. The article, complete with code, proves yet again that there is (definitely) more than one way to do it. Try your hand. Add improvements and you may have a handy search engine for your site or... the next Google-killer."

2 of 20 comments (clear)

Min score:

Reason:

Sort:

no good for large collections of documents by rpeppe · 2003-02-26 04:57 · Score: 4, Insightful

As far as I can see, this algorithm runs in time proportional to the number of documents, which might be fine for a small number of documents, but would be abysmal in large environments.
It's a pity the article doesn't make this clearer (instead, it just says "Searches take place in RAM, so there's no disk or database access", which makes it sound as if it would be faster!)
There's a reason people invented indexing schemes!
1. Re:no good for large collections of documents by FamousLongAgo · 2003-02-26 05:09 · Score: 4, Insightful
  
  [disclaimer: I wrote the article]
  
  Since when is linear scaling abysmal? Keep in mind that for very large collections, you can use clustering or other techniques to reduce the number of comparisons. You essentially layer an indexing step on top of things, so that you only search the most promising vectors.
  
  Keep in mind also that vector comparisons are very, very, very fast. People have run vector space search engines on collections in the millions without running into performance issues. And the claim that vector search is faster than a database search is true, for as long as you can fit all the document vectors into RAM.
  
  --
  
  A customer service representative will be with me shortly.