Vector Space Search Engines in Perl
cascadefx writes "There is a great article on writing a Vector Space Search Engine in Perl over at the Orielly Network's Perl.com site. Most search engines, such as Google, use a reverse keyword index algorithm. The article, complete with code, proves yet again that there is (definitely) more than one way to do it. Try your hand. Add improvements and you may have a handy search engine for your site or... the next Google-killer."
... Quantum(tm) Computers become available! They can search through Hilbert space in polytime! Beat that with your puny Turing machines, ha!
It's a pity the article doesn't make this clearer (instead, it just says "Searches take place in RAM, so there's no disk or database access", which makes it sound as if it would be faster!)
There's a reason people invented indexing schemes!
if you have visited www.kartoo.com you would notice that it is based on the same vector concept when searching for data, ofcourse it has the 3D gui that makes it easier to pick what you are looking for, personally i cant wait for google to implement and provide an option of being able to search the same way as kartoo.com does, in some incidents when i was cross searching for information, kartoos interface sped things up, so three cheers for vector based search engines ...
(vector based mapping is used in voice recognition technologies... many 2.5G and 3G small companies that are attempting to provide the next must have mobile voice services are using vector based mapping for recognition, but ofcourse mapping doesnt do the whole job by itself ;)
..... Ive quit sex, drugs and rocknroll but not lieing
can it find the wavefunction in problem 12.3 in Merzbacher? Goddamn Hilbert spaces!
I should hang out in the developers area of Slash more often. Sure, there were only 15 posts, but they were all good. Even the disagreements were interesting. Reminds me of the old Chips and Dips.
For all those spanish-readers interested, here is a very interesting article about VSM theory and interesting addings to search algorithms.
I thought of something similar to this when trying to come up with a good spellchecking algorithm back in high school. To provide a simple example, we can start with trying to spellcheck binary strings. We can start off naively, representing each "word" as a point in a two-dimensional vector space over the field Naturals (N x N). Each word occupies a point in the space, calculated by summing the 1s and 0s in the word. "101000110" would occupy point (5, 4) - five 0s, four 1s. So given a misspelled word (let's say "101001110").. we first calculate that word's position in the vector space (in this case (6, 4)), and do a geograhic search for neighboring words.
Just taking the sums of the different letters in the word for different domains keeps closely-spelled words together, but it also maps other, highly less likely words to the same (or close) point in the vector space. For example: "101111001000" and "010000110111" would map to the same point.. although it's unlikely that one would be trying to type one and accidentally type the other (unless the typer was dyslexic or something.. in which case the spellchecker doesn't really have to care).
So we need something that gives a bit more information than simply positioning the word on a 2-dimensional vector space. So we can notice that often, similar words have similar sequences of letters, as well as a similar sum total of letters. I.e., the number of times a particular number "0" or "1" occurs after either a "0" or "1". For example in "10011101" and "10011001". The number of "0"s followed by "0"s is 1 in the first, 2 in the second. The number of "0s" followed by "1s" is 2 in both. The number of "1s" followed by "0"s is 2 in both, and the number of "1"s followed by "1"s is 2 in the first, 1 in the second. So the words would map to points (1, 2, 2, 2) and (2, 2, 2, 1) respectively, in the 4-dimenstional vector space over the Naturals (N x N x N x N). The distance between these two points, is sqrt(2).. which is pretty close.
We can combine the first approach and the second, and for each misspelled word, do a search for other words in both the 2D space and the 4D space, and allow all words within a certain distance threshold that are present in both spaces near the misspelled word. Potentially you could extend the spaces out so that you consider 3 consecutive letters (an 8D space for binary strings).. but as the dimensions get higher, you have to put more work into optimizing the searches. If we move from binary strings to actual english words, it's not a 2D and 4D space anymore.. but a 26D and 676D space. But it's possible to come up with some nice sparse data structures that let you make searches reasonably efficient in both space and time.
I'm not exactly sure why I'm posting this.. but I blazed last night and the shit was good.. but my coat got stolen.. and I've got cobwebs this morning.. and I tend to think of CS things a lot when I have cobwebs in my mind after a night of battling the green ninja.
-Laxitive