Any Interest in a Regexp-Based Web Search Engine?
K-Man asks: "From time to time, I've seen people comment that they would be interested in searching the web with regular expressions, but I've seen very little research in this area. Over many months (as part of a project I call 'grepple'), I've gradually assembled some background on the idea (also some work-in-progress not noted in the link), and the idea seems to be approaching the realm of technical possibility. However, my expertise is not in marketing, so I have no idea whether anybody would use this capability. So I ask, if you could search the web for any regular pattern, including html, partial words or wildcards, long phrases, or anything you might grep out of an html file, would you do it? What types of searches would you do?"
([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0 -9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[ 0-9]{1,3})(\]?)
I'd be interested. Probably not interested enough to pay for the service, but still.
But it seems that you'd have a huge performance problem you'd have to work around. Search engines work by indexing the words as-is. Since you can't do that with a regexp search, I can't see any way that you could have a regexp search engine that didn't have to scan every page for every new search.
I'd definitely use it a lot, for searches that Google couldn't handle. Some examples:
- the obvious one is 'stem*' to get all words that begin with a certain string, but sometimes I might want the opposite '*ending' as well
- if I'm unsure of the spelling, 'start?end' could come in handy
- most search-engines are useless for specifying punctuation or capitalization
- I'd like to be able to search for ranges of dates using '18??' or the equivalent
- phrases with gaps or alternate forms ("All your [x] belong to [y]")
My recommendation would be to start with strong-content sites (Project Gutenberg, Wired, etc) and see how computationally expensive it becomes, one step at a time.
You have a point, but I have no mod points at the moment, save the ones I coin myself. Any new ability will invite new abuses (or, at least, new forms of old ones).
-- MarkusQ
P.S. For the regexp challenged, the parent poster was showing how easy it would be to use a rexular expression search engine to harvest e-mail addresses which the Bad Guy could then send spam to.
a real time regex engine would perform regexes on condensed byte code of a page rather than the actual page. this is bound to be lossy.
the only way i can see it happening is an associated list of popular searches is entered into the db store, and regularly updated. sadly you're going up in factors, depending on how many expressions you have, so it'd be a huge db pull.
maybe... it's a cute idea. I'm sure something client side would be easier, with the advent of broadband in most homes.
Matt
While there are some cute tricks you can do with a regexp-based engine on the user's side, cute tricks do not a viable technology make. Along with the obvious computational issues, and the difficulty (though perhaps not impossibility) of a creating a caching scheme, I think there's the problem that most use cases where someone might really want to use your search engine, there are more promising ways to approach the problem other then regexps.
The two ones that come to mind are word stemming approaches and things trying to take advantage of processing that's closer to (though of course not necessarily reaching) natural language processing. Both of those improvements are really useful, and are at least possible to implement, though not easy.
Word stemming approaches eliminate the whole class of "I want every form of kill: kill(|ed|er|ing)" queries; plus you don't want a human to have to enumerate that.
Phrase alternations is already handle by existing syntax: "All your (base OR chili) is belong to (us OR them)." You don't need regexp for that.
Most of the rest of the examples of where a regexp might be useful are almost certainly toys, that sound like a cool hack but won't actually be useful.
Note that a counterexample requires not yet another probably-silly hack, but a plausible usecase where you have an example of something you were really searching for, that a regexp engine might have been able to solve, and that there was no good way of finding currently. In my experience the only searches that I can't do are the ones for things where there isn't a search term I can use that will unique identify what I'm looking for out of a sea of pages related to that term, but not what I'm looking for. One example I recall was looking for how poisonous a philodendren is to a cat; if the info is out there, it's swamped by pages saying simply that it is poisonous, with no indication of how much.
That's an example where a hypothetical search engine with better NLP might have helped me, where I could have asked it for only a page that included "how much" information about the poison level, and not its mere existance.
On the one hand, I'd take this with a grain of salt as I'm just a random Internet yahoo, and you'll always find someone who says "X won't work." You can't let that be a stopper. On the other hand, you might want to mull this over and be sure you are not being overoptimistic about the usefullness of this before committing much resources to it. In particular, I recommend scrutinizing your own usage of real search engines over the next few weeks, and ideally the usage of others, and make sure that you're sure your approach can beat Google in at least some useful domain. Overoptimistic assessments of one's own program is a very real danger of being a programmer and it has scuttled more then one project.
(1) users tend not to type as many regex as you would think
(2) it is too easy to create a query that matches half the words in the index, bogging down to a crawl your search
(3) in all likelihood what you want is a stemmer and something that allows typos, not a full fledged regular expression matcher
(4) the main problem with search engines is that they return too many results, not too few. Regex search capabilities further increase the size of the result set.
(5) let me repeat point (3). Regular expressions are not a natural operation when searching natural language.
You can't scale it. Indexing systems that could be used as a foundation for regexes (CDAWG structures or similar) don't scale to the level of the Web.
If you want to do searching of a small intranet, you might be able to get away with it. You might be able to do globbing, but currently using regexes won't work.
The main regex-related features I suspect people might want are:
* Phrases. Google and almost all other search engines can already do this, with quotes.
* NEAR. foo NEAR bar in the document requests documents where foo occurs "near" bar. This is of somewhat more dubious utility, but there are some searches that it's convenient for.
* Boolean NOT. Google and almost all other search engines can already do this.
May we never see th
The idea is that any character sequence in the source can be found in time only proportional to the pattern length, not the data size.
The penalty is a bit of space for indexing, but methods for compressed indexing have been found which use only about 40% of the source text size to hold both the index and the source text.
IMHO, much of the performance problem has already been solved, so the question is really whether people would use a tool if it were developed.
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
an interesting use of this would be on top of the results from say google. Google already seems to give the best results. Now simply use an RE engine on top of that would enable a user to get better results.
I prefer the "u" in honour as it seems to be missing these days.
Since I did that original writeup I've added considerably to what I know about indexing for this sort of thing, and in fact since I submitted this story I've done some work which looks quite encouraging. Rather than post a bunch of replies I'll round up what I can here:
/ab(le|ility)/ can be matched by finding matches for "ab", then "abl", "able", "abi", etc. An index which allows progressive refinement of the pattern, from "a" to "ab" to "ability", is a big help.
The most promising method for supporting this idea is a full-text index, one which allows any byte sequence in the source to be looked up quickly. That way, a regexp like
It's also important to know when no longer matches exist, for instance if "ab" has no hits, then "abi" doesn't need to be checked.
The big leap which makes this seem possible is Ferragina and Manzini's FM index. This method takes the size of a full-text index from somewhere around 10 times the source, to around 40%, including the text as well as the indexing. Their algorithm is described in relation to fixed-length patterns, but it's a trivial extension to handle regexp-generated sets of patterns as well.
In the past few weeks I've been working on an implementation of a similar algorithm with possible performance improvements.
So, the short answer is "yes, it's possible". There are a few hitches here and there, but in comparison to what I knew a year ago, it's much more workable than I would have guessed.
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
...already supports this (you most often see it in a free search engine called Webinator). It's the search db behind Dogpile, some (all?) of Ebay, parts of ZDNet, and a whole bunch of other stuff. Not cheap by any stretch but solid.
Check it out: http://www.thunderstone.com/
======================================
Writers get in shape by pumping irony.
Very much interested in this. In fact, I've written letters to Google and Yahoo requesting this but never got much beyond a polite thanks for the suggestion.
... they already support automatic ANDing and will understand OR as well, but grouping makes a big difference. That and I wish Google would allow more than 10 terms ... when you start using OR to describe something (like for my motorcycle it would be :
...
... I considered trying to create a meta-search script using Regex for Google (private use, so hopefully not illegal, but would probably still get banned if they caught on) but found it took too much time for my little machine on a narrowband connection to download each page and re-index it based on that additional regex processing.
Actually, I'd be pretty satisfied if Google supported the advanced boolean search that Altavista has. When Altavista had one of, if not the, best databases I regularly used it. Take a look at:
Altavista Special Search Terms
I find that a combination of wildcards, AND, OR, NEAR, NOT, grouping via parentheses and being able to search specifically for anchors, images, etc meets 99% of my needs. Full regex would be nice, but not that much more useful. Plus, I would imagine regex would be a lot harder on the search server than the simplified advanced syntax.
I -really- wish Google supported matching via parentheses
V65 OR V-65 OR "Honda Magna" OR VF1100C
and I've already eaten up 3 terms
The problem with starting a new search engine is it won't get used, even if it has amazing features, until it has a HUGE number of pages indexed. You might want to target specific subjects at first. Or, depending on the legality, create a meta search engine
The key to success is to index a BUNCH of sites before wide announcement (possibly by using the method mentioned a few days ago of harnessing a distributed processing project to add indexing servers, I'd contribute to something like this) of the project and make sure that you don't limit the effectiveness by limiting the number of terms.
It is more productive to voice thoughtful opinions (reply) than to judge (moderate) others.