Slashdot Mirror


Any Interest in a Regexp-Based Web Search Engine?

K-Man asks: "From time to time, I've seen people comment that they would be interested in searching the web with regular expressions, but I've seen very little research in this area. Over many months (as part of a project I call 'grepple'), I've gradually assembled some background on the idea (also some work-in-progress not noted in the link), and the idea seems to be approaching the realm of technical possibility. However, my expertise is not in marketing, so I have no idea whether anybody would use this capability. So I ask, if you could search the web for any regular pattern, including html, partial words or wildcards, long phrases, or anything you might grep out of an html file, would you do it? What types of searches would you do?"

8 of 51 comments (clear)

  1. too highly factorized by QX-Mat · · Score: 3, Interesting

    a real time regex engine would perform regexes on condensed byte code of a page rather than the actual page. this is bound to be lossy.

    the only way i can see it happening is an associated list of popular searches is entered into the db store, and regularly updated. sadly you're going up in factors, depending on how many expressions you have, so it'd be a huge db pull.

    maybe... it's a cute idea. I'm sure something client side would be easier, with the advent of broadband in most homes.

    Matt

  2. Full text indexing by K-Man · · Score: 3, Interesting

    The idea is that any character sequence in the source can be found in time only proportional to the pattern length, not the data size.

    The penalty is a bit of space for indexing, but methods for compressed indexing have been found which use only about 40% of the source text size to hold both the index and the source text.

    IMHO, much of the performance problem has already been solved, so the question is really whether people would use a tool if it were developed.

    --
    ---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
  3. Re:Been there, done that, won't work by K-Man · · Score: 3, Interesting

    Yes, I agree that pathological regexp's are easy to create, but limits on match length and counts are easy to impose.

    At the technical level, one indexing method I'm currently looking at (the FM index) has a couple of advantages. First, it is incremental, extending a match one character at a time, and allows backtracking etc. to probe different legs of a regexp. It's also very quick at counting hits at each step, making realtime pruning of query results very easy.

    --
    ---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
  4. Re:Fallible memory, etc by Johnny+Mnemonic · · Score: 2, Interesting


    I'd be interested, mostly to exclude search hits that were not related to the topic of interest by anything other than an accident of vocabulary.

    For example, if I wanted to search for the use of "Star Wars" in relation to the "Space Defense Initiative" and am not interested in the movie "Star Wars", I would very much like to have a search of "Star Wars !movie". I don't think Google can do this very well, although I haven't tried much either. Another example would be multiple operators, eg +(Apple AND/OR Mac AND/OR MAC) and (job AND/OR position). Most search engines can't seem to handle multiple part substitutions very well.

    --

    --
    $tar -xvf .sig.tar
  5. To the "It won't work" folks by K-Man · · Score: 2, Interesting

    Since I did that original writeup I've added considerably to what I know about indexing for this sort of thing, and in fact since I submitted this story I've done some work which looks quite encouraging. Rather than post a bunch of replies I'll round up what I can here:

    The most promising method for supporting this idea is a full-text index, one which allows any byte sequence in the source to be looked up quickly. That way, a regexp like /ab(le|ility)/ can be matched by finding matches for "ab", then "abl", "able", "abi", etc. An index which allows progressive refinement of the pattern, from "a" to "ab" to "ability", is a big help.

    It's also important to know when no longer matches exist, for instance if "ab" has no hits, then "abi" doesn't need to be checked.

    The big leap which makes this seem possible is Ferragina and Manzini's FM index. This method takes the size of a full-text index from somewhere around 10 times the source, to around 40%, including the text as well as the indexing. Their algorithm is described in relation to fixed-length patterns, but it's a trivial extension to handle regexp-generated sets of patterns as well.

    In the past few weeks I've been working on an implementation of a similar algorithm with possible performance improvements.

    So, the short answer is "yes, it's possible". There are a few hitches here and there, but in comparison to what I knew a year ago, it's much more workable than I would have guessed.

    --
    ---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
    1. Re:To the "It won't work" folks by dirtydamo · · Score: 2, Interesting

      Still, there are some theoretical limitations, e.g.:

      This gives a worst-case linear lower bound on the size of an index structure for substring search, which is obviously necessary for "full" regexp power. Of course, I doubt anyone really wants full regexps; the challenge you face is constructing a powerful enough subset that is easy to implement.

      Personally, like other posters have mentioned, I am only really interested in stem searches such as stem*.

  6. Yes! by Jahf · · Score: 2, Interesting

    Very much interested in this. In fact, I've written letters to Google and Yahoo requesting this but never got much beyond a polite thanks for the suggestion.

    Actually, I'd be pretty satisfied if Google supported the advanced boolean search that Altavista has. When Altavista had one of, if not the, best databases I regularly used it. Take a look at:

    Altavista Special Search Terms

    I find that a combination of wildcards, AND, OR, NEAR, NOT, grouping via parentheses and being able to search specifically for anchors, images, etc meets 99% of my needs. Full regex would be nice, but not that much more useful. Plus, I would imagine regex would be a lot harder on the search server than the simplified advanced syntax.

    I -really- wish Google supported matching via parentheses ... they already support automatic ANDing and will understand OR as well, but grouping makes a big difference. That and I wish Google would allow more than 10 terms ... when you start using OR to describe something (like for my motorcycle it would be :

    V65 OR V-65 OR "Honda Magna" OR VF1100C

    and I've already eaten up 3 terms ...

    The problem with starting a new search engine is it won't get used, even if it has amazing features, until it has a HUGE number of pages indexed. You might want to target specific subjects at first. Or, depending on the legality, create a meta search engine ... I considered trying to create a meta-search script using Regex for Google (private use, so hopefully not illegal, but would probably still get banned if they caught on) but found it took too much time for my little machine on a narrowband connection to download each page and re-index it based on that additional regex processing.

    The key to success is to index a BUNCH of sites before wide announcement (possibly by using the method mentioned a few days ago of harnessing a distributed processing project to add indexing servers, I'd contribute to something like this) of the project and make sure that you don't limit the effectiveness by limiting the number of terms.

    --
    It is more productive to voice thoughtful opinions (reply) than to judge (moderate) others.
  7. google api by Anonymous Coward · · Score: 1, Interesting
    I created a google app in perl that did a search for a fixed string, then did a regexp search on the resulting websites. Slower and more limiting than if google did it, but I've got a T1 and a 4-way Xeon :)

    The results are fairly good. It's on sourceforge if anyone wants to use it. They seem to be down right now, or I'd give the url.