Any Interest in a Regexp-Based Web Search Engine?
K-Man asks: "From time to time, I've seen people comment that they would be interested in searching the web with regular expressions, but I've seen very little research in this area. Over many months (as part of a project I call 'grepple'), I've gradually assembled some background on the idea (also some work-in-progress not noted in the link), and the idea seems to be approaching the realm of technical possibility. However, my expertise is not in marketing, so I have no idea whether anybody would use this capability. So I ask, if you could search the web for any regular pattern, including html, partial words or wildcards, long phrases, or anything you might grep out of an html file, would you do it? What types of searches would you do?"
a real time regex engine would perform regexes on condensed byte code of a page rather than the actual page. this is bound to be lossy.
the only way i can see it happening is an associated list of popular searches is entered into the db store, and regularly updated. sadly you're going up in factors, depending on how many expressions you have, so it'd be a huge db pull.
maybe... it's a cute idea. I'm sure something client side would be easier, with the advent of broadband in most homes.
Matt
The idea is that any character sequence in the source can be found in time only proportional to the pattern length, not the data size.
The penalty is a bit of space for indexing, but methods for compressed indexing have been found which use only about 40% of the source text size to hold both the index and the source text.
IMHO, much of the performance problem has already been solved, so the question is really whether people would use a tool if it were developed.
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
Yes, I agree that pathological regexp's are easy to create, but limits on match length and counts are easy to impose.
At the technical level, one indexing method I'm currently looking at (the FM index) has a couple of advantages. First, it is incremental, extending a match one character at a time, and allows backtracking etc. to probe different legs of a regexp. It's also very quick at counting hits at each step, making realtime pruning of query results very easy.
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger