Any Interest in a Regexp-Based Web Search Engine?

I'm a spammer - I'd use the following query: by Anonymous Coward · 2003-04-27 04:10 · Score: 5, Insightful

([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0 -9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[ 0-9]{1,3})(\]?)

Re:I'm a spammer - I'd use the following query: by Anonymous Coward · 2003-04-27 04:25 · Score: 0

s/4/5/
for .museum addresses...
Re:I'm a spammer - I'd use the following query: by Anonymous Coward · 2003-04-27 04:48 · Score: 0

ahem - the word "museum" has 6 letters, not 5
Re:I'm a spammer - I'd use the following query: by Anonymous Coward · 2003-04-27 07:57 · Score: 0

It's a little more complicated than that, if you want to really grab any valid email address. See this perl script, which carefully builds a huge regexp that almost satisfies rfc822 definition of a valid email address. The uncommented version of the regexp is far more daunting, but I couldn't find a copy of it online. It's the finalle of the the book Mastering Regular Expressions, and it takes up a whole page. A page of solid regexp makes obfu perl look like well ordered data!
Re:I'm a spammer - I'd use the following query: by larry+bagina · 2003-04-27 13:21 · Score: 1

a spam harvesting robot doesn't care if the email address is rfc822 compliant. It only cares about name@host, which is what the regexp found, and is included in all (all common at least) of the rfc happy email addresses, and (more importantly) is the form used on most webpages.

--
Do you even lift?
These aren't the 'roids you're looking for.

Interest? Sure. by Violet+Null · 2003-04-27 04:14 · Score: 3, Insightful

I'd be interested. Probably not interested enough to pay for the service, but still.

But it seems that you'd have a huge performance problem you'd have to work around. Search engines work by indexing the words as-is. Since you can't do that with a regexp search, I can't see any way that you could have a regexp search engine that didn't have to scan every page for every new search.

Fallible memory, etc by RobotWisdom · 2003-04-27 04:26 · Score: 5, Insightful

I'd definitely use it a lot, for searches that Google couldn't handle. Some examples:

- the obvious one is 'stem*' to get all words that begin with a certain string, but sometimes I might want the opposite '*ending' as well

- if I'm unsure of the spelling, 'start?end' could come in handy

- most search-engines are useless for specifying punctuation or capitalization

- I'd like to be able to search for ranges of dates using '18??' or the equivalent

- phrases with gaps or alternate forms ("All your [x] belong to [y]")

My recommendation would be to start with strong-content sites (Project Gutenberg, Wired, etc) and see how computationally expensive it becomes, one step at a time.

Re:Fallible memory, etc by shooz · 2003-04-27 06:27 · Score: 1

For phrases with gaps, try Google's * operator, such as:

http://www.google.com/search?q=%22all+your+*+are +b elong+to+*%22
Re:Fallible memory, etc by ptaff · 2003-04-27 07:44 · Score: 1

- the obvious one is 'stem*' to get all words that begin with a certain string, but sometimes I might want the opposite '*ending' as well

Very useful to look for a file or a set of files.

regex:/href=".*cowboyneal\.jpg"/i
Re:Fallible memory, etc by Johnny+Mnemonic · 2003-04-27 08:38 · Score: 2, Interesting

I'd be interested, mostly to exclude search hits that were not related to the topic of interest by anything other than an accident of vocabulary.

For example, if I wanted to search for the use of "Star Wars" in relation to the "Space Defense Initiative" and am not interested in the movie "Star Wars", I would very much like to have a search of "Star Wars !movie". I don't think Google can do this very well, although I haven't tried much either. Another example would be multiple operators, eg +(Apple AND/OR Mac AND/OR MAC) and (job AND/OR position). Most search engines can't seem to handle multiple part substitutions very well.

--

--
$tar -xvf .sig.tar
Re:Fallible memory, etc by Anonymous Coward · 2003-04-27 09:05 · Score: 0

>- phrases with gaps or alternate forms ("All your [x] belong to [y]")

Don't you mean "All your [x] are belong to [y]"?
Re:Fallible memory, etc by larry+bagina · 2003-04-27 13:28 · Score: 1

I don;t know if it's been fixed, but Mastering Regular Expressions (first edition) said you shoulf avoid /i like the plague. It works by uppercasing your regex, then making a copy of whatever you're searching, and uppercasing that. For a short string, it's ok, but for large files (and if you search the web, that's a LOT of large files) it is ridiculously slower than case-sensitive.
Manually desensitizing ([Hh][Rr][Ee][Ff]) doesn't have a performance penalty to speak of, so that would have to be done behind the scenes.

--
Do you even lift?
These aren't the 'roids you're looking for.
Re:Fallible memory, etc by Anonymous Coward · 2003-04-27 22:13 · Score: 0

Wow! Learn something new every day!
Re:Fallible memory, etc by khakipuce · 2003-04-27 23:37 · Score: 1

Most of your examples, and the majoirty of things I would want would be met by splitting the problem in to two parts: 1) do a normal search for the non-regex parts; 2) apply the regex to rank the results (maybe pick the top n sites to limit the subset for the regex search).
So for example, "Fred* Bloggs" would search for all pages with "Bloggs" and then regex for the Fred part.
To do a date search, e.g. find events on your birthday - 29/02/???? - find all pages with "29/02" then find 29/02/????.
not perfect but seems like a good compromise. The real drawback is it assumes there would always be a non-regex part to the search, however as I suggested, this is very iften the case.

--
Art is the mathematics of emotion

+1 Funny/insightful on the MQR standard by MarkusQ · 2003-04-27 04:41 · Score: 5, Informative

You have a point, but I have no mod points at the moment, save the ones I coin myself. Any new ability will invite new abuses (or, at least, new forms of old ones).

-- MarkusQ

P.S. For the regexp challenged, the parent poster was showing how easy it would be to use a rexular expression search engine to harvest e-mail addresses which the Bad Guy could then send spam to.

Re:+1 Funny/insightful on the MQR standard by Lord+Omlette · 2003-04-27 09:08 · Score: 1

P.S. For the regexp challenged, the parent poster was showing how easy it would be to use a rexular expression search engine to harvest e-mail addresses which the Bad Guy could then send spam to.
I would imagine that this is proof that such a search engine already exists...

--
[o]_O
Re:+1 Funny/insightful on the MQR standard by SN74S181 · 2003-04-27 09:24 · Score: 1

Elcomsoft, the firm where one of our little poster boys works, produces commercially available search tools specifically for harvesting email addresses out of webspace. I say 'commercially avaialble' because if people like them didn't produce said tools, only net savvy geeks who understand a little about nettiquete would have said abilities.

Thanks Dmitry. And keep defending folks like him, EFF.
Re:+1 Funny/insightful on the MQR standard by WindBourne · 2003-04-27 09:35 · Score: 1

There will always be abuses of spam. If the spammers do not buy their addresses, they will simply find them via a robot (just about as easy to set-up as a search engine search).
I would worry less about finding addresses and more about how the e-mail is being sent. The new virus hitting MS XP, W2K, and NT systems is far more of problem than is this. If ppl would shut down the few Open UNix boxes and the many Exchange servers, then the spamers would be trouble.

--
I prefer the "u" in honour as it seems to be missing these days.
Re:+1 Funny/insightful on the MQR standard by larry+bagina · 2003-04-27 13:13 · Score: 1

"DEG DED {DE}F ED CBCA..." (George Gershwin)
OK I give up. What is it?

--
Do you even lift?
These aren't the 'roids you're looking for.
Re:+1 Funny/insightful on the MQR standard by MarkusQ · 2003-04-28 04:02 · Score: 1

Rhapsody in blue (try it on a piano some time).
-- MarkusQ

too highly factorized by QX-Mat · 2003-04-27 04:53 · Score: 3, Interesting

a real time regex engine would perform regexes on condensed byte code of a page rather than the actual page. this is bound to be lossy.

the only way i can see it happening is an associated list of popular searches is entered into the db store, and regularly updated. sadly you're going up in factors, depending on how many expressions you have, so it'd be a huge db pull.

maybe... it's a cute idea. I'm sure something client side would be easier, with the advent of broadband in most homes.

Matt

Probably not... by Jerf · 2003-04-27 05:02 · Score: 5, Insightful

While there are some cute tricks you can do with a regexp-based engine on the user's side, cute tricks do not a viable technology make. Along with the obvious computational issues, and the difficulty (though perhaps not impossibility) of a creating a caching scheme, I think there's the problem that most use cases where someone might really want to use your search engine, there are more promising ways to approach the problem other then regexps.

The two ones that come to mind are word stemming approaches and things trying to take advantage of processing that's closer to (though of course not necessarily reaching) natural language processing. Both of those improvements are really useful, and are at least possible to implement, though not easy.

Word stemming approaches eliminate the whole class of "I want every form of kill: kill(|ed|er|ing)" queries; plus you don't want a human to have to enumerate that.

Phrase alternations is already handle by existing syntax: "All your (base OR chili) is belong to (us OR them)." You don't need regexp for that.

Most of the rest of the examples of where a regexp might be useful are almost certainly toys, that sound like a cool hack but won't actually be useful.

Note that a counterexample requires not yet another probably-silly hack, but a plausible usecase where you have an example of something you were really searching for, that a regexp engine might have been able to solve, and that there was no good way of finding currently. In my experience the only searches that I can't do are the ones for things where there isn't a search term I can use that will unique identify what I'm looking for out of a sea of pages related to that term, but not what I'm looking for. One example I recall was looking for how poisonous a philodendren is to a cat; if the info is out there, it's swamped by pages saying simply that it is poisonous, with no indication of how much.

That's an example where a hypothetical search engine with better NLP might have helped me, where I could have asked it for only a page that included "how much" information about the poison level, and not its mere existance.

On the one hand, I'd take this with a grain of salt as I'm just a random Internet yahoo, and you'll always find someone who says "X won't work." You can't let that be a stopper. On the other hand, you might want to mull this over and be sure you are not being overoptimistic about the usefullness of this before committing much resources to it. In particular, I recommend scrutinizing your own usage of real search engines over the next few weeks, and ideally the usage of others, and make sure that you're sure your approach can beat Google in at least some useful domain. Overoptimistic assessments of one's own program is a very real danger of being a programmer and it has scuttled more then one project.

My overall view on the subject.. by Lord+Bitman · 2003-04-27 05:13 · Score: 1

Anything that lets you search should support Regexps
Anything that displays data should allow you to search
Basically, absolutely everything should support regexp search, even if it doesnt make any fucking sense to do so.

Problem: the ways regular expressions work aren't anywhere near standard from program to program. Even a minor syntax change like "In this one, you need to put a slash before parens in order to make the parens special" vs "in this one, you need to put a slash before parens or else it is treated special" completely blows the whole thing

But no need for you to do anything, just tell google to start supporting regexps, get a site up which petitions them or something. At least near-matches, jeeze.. "Boob" should get all results for "Boobs", "Boobies", "Boobzilla", etc.. damnit!

If you disagree with the above statements for any reason whatsoever, you are completely wrong.

--
-- 'The' Lord and Master Bitman On High, Master Of All

Been there, done that, won't work by Alomex · 2003-04-27 05:14 · Score: 4, Insightful

(1) users tend not to type as many regex as you would think

(2) it is too easy to create a query that matches half the words in the index, bogging down to a crawl your search

(3) in all likelihood what you want is a stemmer and something that allows typos, not a full fledged regular expression matcher

(4) the main problem with search engines is that they return too many results, not too few. Regex search capabilities further increase the size of the result set.

(5) let me repeat point (3). Regular expressions are not a natural operation when searching natural language.

Re:Been there, done that, won't work by K-Man · 2003-04-27 08:31 · Score: 3, Interesting

Yes, I agree that pathological regexp's are easy to create, but limits on match length and counts are easy to impose.

At the technical level, one indexing method I'm currently looking at (the FM index) has a couple of advantages. First, it is incremental, extending a match one character at a time, and allows backtracking etc. to probe different legs of a regexp. It's also very quick at counting hits at each step, making realtime pruning of query results very easy.

--
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger

hmmm by hswerdfe · 2003-04-27 05:43 · Score: 1

ok firstly you posted a link the "the other site."...I thought that wasn't allowed...=)

any way ....

I think it would be very cool and very usefull, if it could be done without scaling problems, I am not an Expert on RE's but I've always been told that they are slower than indexed lookups, and don't scale to masive quantities of info.
and if it can't be done without scaling problems...it could be done for a subset of the net. like find all matching entries of Regex1 within all Url's matching Regex2.

I find they tend to be usefull when looking for something you kinda remember what it is but not quite...
URL's - addresses - a parameter or function name you can't quite remember ....
one application I might use it for would be to find examples of code....or of files the match a paticular structure. ...
the web is all about files having almost no structure. an advantage of a RE is that I might be able to build a filter for a paticular file foremat first and work with it from there

if I had a RE wb engine I could find all files in say the SHEF file format with mesurments from a specific station, or group of stations.

but to be honest, if it works and in scales...People will find applications for it, even if they arn't obvious now....

[Aside]
does anybody know why the standard [ctrl]+F function in most web browsers sucks so much. why can't I search for regex in my open web page...
that as of right now is my biggest Gripe about Pheonix, is its shitty find box. anybody have a better plugin?
[/Aside]

--
--meh--

It won't work by 0x0d0a · 2003-04-27 06:29 · Score: 2, Informative

You can't scale it. Indexing systems that could be used as a foundation for regexes (CDAWG structures or similar) don't scale to the level of the Web.

If you want to do searching of a small intranet, you might be able to get away with it. You might be able to do globbing, but currently using regexes won't work.

The main regex-related features I suspect people might want are:

* Phrases. Google and almost all other search engines can already do this, with quotes.

* NEAR. foo NEAR bar in the document requests documents where foo occurs "near" bar. This is of somewhat more dubious utility, but there are some searches that it's convenient for.

* Boolean NOT. Google and almost all other search engines can already do this.

--
May we never see th

Re:hmmm - Bookmarklets! by ptaff · 2003-04-27 07:29 · Score: 1

You can regex in a page with a bookmarklet, works usually with any javascript-enabled browser.

One of them is here, spawn your favorite search engine and look for bookmarklets, there are plenty.

Bookmarklets and smart bookmarks (not available in IE) can make magic and turn your browser into a very powerful process ;)

Re:hmmm - Bookmarklets! by hswerdfe · 2003-04-27 07:45 · Score: 0

Sweet FA Man.

Thanks, this all good and is almost exacly what I want.

--
--meh--

I'd buy that for a dollar... by i0wnzj005uck4 · 2003-04-27 08:11 · Score: 1

Well, I can't honestly say I'd pay for such a service, but even being able to do simple regex stuff like "There.*gun and gunshot.*who shot who" would be nice. I find that most of my regexp searches, even in grep, are just looking for parts of a sentence or code block using .* .

However, the above comment on how most people wouldn't be using regex in your engine is a valid one. You'd prolly want to pass off non-regex searches to a more suited engine (ie google), while handling the real searches yourself. Also, the idea of starting small -- like indexing the library of congress, gutenberg, about.com -- may be best. Then you'll have a good idea of the load of searching them alone. Kind of a test run.

The spam thing above is interesting too. I can't admit to totally understanding it (I have to write each regex I do looking in a book; they're just not second nature to me), but what would happen if someone wrote one that takes both normal e-mail addresses and spamfiltered ones (dookie at 3drealms dot org)? Google is pretty good on a lot of things about not censoring content (china's issues aside -- I said pretty good, not perfect), so if you were going to make a search engine like this you'd have to be okay with the level of power you'll be giving to both good citizens and the rest.

--
- Cloud

Re:I'd buy that for a dollar... by K-Man · 2003-04-27 09:15 · Score: 1

I agree that the query mix would probably be 99% keywords, and 1% regexp, but sometimes that 1% makes all the difference in usefulness. I think the keyword stuff would work fine also (each keyword qualifies as a regexp, just a really boring one); it would just be an added bonus to be able to do regexp's or any character sequence.

The spam thing is valid, although it's already hazardous to post an email on the web. Try typing your phone number into Google - that's another surprise that most people aren't aware of.

--
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger

Full text indexing by K-Man · 2003-04-27 08:15 · Score: 3, Interesting

The idea is that any character sequence in the source can be found in time only proportional to the pattern length, not the data size.

The penalty is a bit of space for indexing, but methods for compressed indexing have been found which use only about 40% of the source text size to hold both the index and the source text.

IMHO, much of the performance problem has already been solved, so the question is really whether people would use a tool if it were developed.

--
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger

Good example by K-Man · 2003-04-27 08:45 · Score: 1

Although you could also search for your own email and find any web page that contains it.

There are a lot of html tag fragments (eg img tags with just part of the src url, href's to a given domain or subsection of a website, etc.) that might be handy to find, at least for technical people.

I share most people's skepticism about regexp's ever becoming mainstream, but it might be a good foundation for value-added services, like finding web pages by color, font, number of images, etc.

--
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger

Re:Good example by pyite · 2003-04-27 10:51 · Score: 1

Most of the population is too stupid to type words correctly in their native tongue, so yes, regular expressions are way too complicated for the average idiot to handle. However, it'd be a nice feature of the Google API or something.

--
"Nature doesn't care how smart you are. You can still be wrong." - Richard Feynman

Possibly... by WindBourne · 2003-04-27 09:37 · Score: 2, Informative

an interesting use of this would be on top of the results from say google. Google already seems to give the best results. Now simply use an RE engine on top of that would enable a user to get better results.

--
I prefer the "u" in honour as it seems to be missing these days.

Would work on binaries too by K-Man · 2003-04-27 09:50 · Score: 1

It should be fine for SHEF or any file format. Once one stops expecting bytes to form words, many file types become indexable.

URL's are a good example of difficult-to-parse search targets. At one time I was looking at parsing urls into components and searching those, but even then it was too hard to search with just a fragment.

--
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger

Google has good ranking by K-Man · 2003-04-27 10:00 · Score: 1

Ranking is a separate issue from selection, or gathering raw hits for a query.

Google doesn't have a mind-bendingly better selection system (it's a lot like any other search engine), but their ranking is, of course, their main advantage.

The issue for a search engine like google would be to cut over from a keyword-based inverted index to something a bit more flexible, while maintaining continuity with the current system.

I think it's possible. We have the technology...(cut to six million dollar man intro).

--
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger

Tip: How to exclude a word in Google by yerricde · 2003-04-27 10:05 · Score: 1

if I wanted to search for the use of "Star Wars" in relation to the "Space Defense Initiative" and am not interested in the movie "Star Wars", I would very much like to have a search of "Star Wars !movie". I don't think Google can do this very well

Google has an exclude function ("star wars" -movie) but because it isn't artificially intelligent, it doesn't exclude movie merchandise such as action figures, computer games, card games, etc. Better: "star wars" -movie -lucas -lucasfilm -trilogy whose tenth result refers to missile defense. But had you known you were looking for something about missile defense or missile defence, you would have typed in "star wars" missile.

--
Will I retire or break 10K?

NEAR in Google by yerricde · 2003-04-27 10:10 · Score: 1

NEAR. foo NEAR bar in the document requests documents where foo occurs "near" bar. This is of somewhat more dubious utility, but there are some searches that it's convenient for.

Google already does this to an extent, using NEARness of your search terms as one of the terms in the ranking equation.

--
Will I retire or break 10K?

To the "It won't work" folks by K-Man · 2003-04-27 10:36 · Score: 2, Interesting

Since I did that original writeup I've added considerably to what I know about indexing for this sort of thing, and in fact since I submitted this story I've done some work which looks quite encouraging. Rather than post a bunch of replies I'll round up what I can here:

The most promising method for supporting this idea is a full-text index, one which allows any byte sequence in the source to be looked up quickly. That way, a regexp like /ab(le|ility)/ can be matched by finding matches for "ab", then "abl", "able", "abi", etc. An index which allows progressive refinement of the pattern, from "a" to "ab" to "ability", is a big help.

It's also important to know when no longer matches exist, for instance if "ab" has no hits, then "abi" doesn't need to be checked.

The big leap which makes this seem possible is Ferragina and Manzini's FM index. This method takes the size of a full-text index from somewhere around 10 times the source, to around 40%, including the text as well as the indexing. Their algorithm is described in relation to fixed-length patterns, but it's a trivial extension to handle regexp-generated sets of patterns as well.

In the past few weeks I've been working on an implementation of a similar algorithm with possible performance improvements.

So, the short answer is "yes, it's possible". There are a few hitches here and there, but in comparison to what I knew a year ago, it's much more workable than I would have guessed.

--
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger

Re:To the "It won't work" folks by dirtydamo · 2003-04-27 14:36 · Score: 2, Interesting

Still, there are some theoretical limitations, e.g.:

This gives a worst-case linear lower bound on the size of an index structure for substring search, which is obviously necessary for "full" regexp power. Of course, I doubt anyone really wants full regexps; the challenge you face is constructing a powerful enough subset that is easy to implement.

Personally, like other posters have mentioned, I am only really interested in stem searches such as stem*.

Why not start from Google by Sepper · 2003-04-27 11:07 · Score: 1

Why not taking the google api and writing a regex engine the search the result of a string....

Or a simple perl script that searches the resutls giving back by the web site?

--
I live in Soviet Canuckistan you insensitive clod!

Re:Why not start from Google by Ciaran_H · 2003-04-29 03:39 · Score: 1

There are a few problems with that...

Firstly the google API will only return 10 results at a time, IIRC, meaning that it wouldn't be possible to meaningfully rank the sites unless you entered a loop to get all the search results from Google - and there could be lots.

Secondly, it means that you need to be able to search Google first before you can pass a regex filter over it - and what string would you use to search Google with?

Even if you could get Google to return likely pages for the regex, you'd still need to retrieve each result's page in real-time and search the page, as Google doesn't give you the full page as part of its results.

It's a nice idea, and it would be great if it was practical. Unfortuantely, for the reasons above, it isn't. :(

Soundex or even better Daitch-Mokotoff by Bazouel · 2003-04-27 11:41 · Score: 1

That is what I would like to have !

Daitch-Mokotoff is able to handle many languages compared to the almost English exclusive Soundex, so I would rather use this algorithm.

And I don't think it would hinder performance that much since you can cache results just as you can with normal queries.

--
Intelligence shared is intelligence squared.

intelligent searching by zogger · 2003-04-27 11:43 · Score: 1

--I'm fairly decent on using google now, eliminating keywords, limiting it to domains, using proper keywords, etc, but tell ya WHAT would work better. I'd like the ability to ask a normal question, where every word had meaning, the sentence structure had meaning, all of the above, to the seach criteria. Just like you talk, exactly like that. Including prepositions, that one non available feature makes a difference in searching, if they could be included it would be great. Now sometimes I can get lucky, if I really think hard on my data request, make the sentence simple, and use quotes around the whole thing, sometimes I get a bingo right on the first page, but usually I get bupkis with that. Search engines now, as good as they have gotten, still force you into being psychic a lot of times.

It would also be nice to be able to use wild cards better, almost like how web pages get scripted with the "if this...then that" commands.

Another would be able to sequence your searches, give a series of complex commands, then hit go, so it parses to a gross result based on your priority of the request by what was requested first, then refines down your steps based on the same sort of odds they use now. A variant on this would assemble several pages of hits that taken as a combined total is your entire result. Say you are researching a complex topic, you know that there are several widely divergent things you need, then you have to assemble them in one place to have all the data you need for your query anser. Now you have to do that one step at a time, manually, open another page or tab, re enter a completely different set of criteria, mash search, blah blah, tedious, un manageable a lot of times. Be nicer to have it "right there". And you know, it wouldn't matter if it took some time, actually it would be perfectly OK to have it queued and run through a more intelligent set of algorythms based on how many features and how detailed your search needs to be. It could show up "later" as a retrievable service perhaps. Spend 15 minutes figuring out all the searches you need and entering your complex set of parameters, send it off. An hour later (or whatever it's timed for) it shows up, downloads in the back ground. Something like that. A human can do this now, that's what paid searchibng services do, but it's tedious and long and expensive. I could deal with a few bad hits, it's the huge number of bad hits that are the bear to deal with all the time, usually you have to enter simpler terms to just get started, so you can then re enter an advanced search after you find out what the non essential terms are that show up in the first pages if you haven't gotten extremely lucky.

Next I want a talking computer to do that, and have my venusian slave girl secretary do the asking for me, but I know that will most likely cost *extra* and require at least another stick of ram or two......

Thunderstone Texis... by PDHoss · 2003-04-27 12:36 · Score: 2, Informative

...already supports this (you most often see it in a free search engine called Webinator). It's the search db behind Dogpile, some (all?) of Ebay, parts of ZDNet, and a whole bunch of other stuff. Not cheap by any stretch but solid.

Check it out: http://www.thunderstone.com/

--
======================================
Writers get in shape by pumping irony.

Yes! by Jahf · 2003-04-27 13:28 · Score: 2, Interesting

Very much interested in this. In fact, I've written letters to Google and Yahoo requesting this but never got much beyond a polite thanks for the suggestion.

Actually, I'd be pretty satisfied if Google supported the advanced boolean search that Altavista has. When Altavista had one of, if not the, best databases I regularly used it. Take a look at:

Altavista Special Search Terms

I find that a combination of wildcards, AND, OR, NEAR, NOT, grouping via parentheses and being able to search specifically for anchors, images, etc meets 99% of my needs. Full regex would be nice, but not that much more useful. Plus, I would imagine regex would be a lot harder on the search server than the simplified advanced syntax.

I -really- wish Google supported matching via parentheses ... they already support automatic ANDing and will understand OR as well, but grouping makes a big difference. That and I wish Google would allow more than 10 terms ... when you start using OR to describe something (like for my motorcycle it would be :

V65 OR V-65 OR "Honda Magna" OR VF1100C

and I've already eaten up 3 terms ...

The problem with starting a new search engine is it won't get used, even if it has amazing features, until it has a HUGE number of pages indexed. You might want to target specific subjects at first. Or, depending on the legality, create a meta search engine ... I considered trying to create a meta-search script using Regex for Google (private use, so hopefully not illegal, but would probably still get banned if they caught on) but found it took too much time for my little machine on a narrowband connection to download each page and re-index it based on that additional regex processing.

The key to success is to index a BUNCH of sites before wide announcement (possibly by using the method mentioned a few days ago of harnessing a distributed processing project to add indexing servers, I'd contribute to something like this) of the project and make sure that you don't limit the effectiveness by limiting the number of terms.

--
It is more productive to voice thoughtful opinions (reply) than to judge (moderate) others.

google api by Anonymous Coward · 2003-04-27 13:33 · Score: 1, Interesting

I created a google app in perl that did a search for a fixed string, then did a regexp search on the resulting websites. Slower and more limiting than if google did it, but I've got a T1 and a 4-way Xeon :)

The results are fairly good. It's on sourceforge if anyone wants to use it. They seem to be down right now, or I'd give the url.

s/? by thejackol · 2003-04-27 17:41 · Score: 1

s/xxx//g Next day on /. "Bug in new regexp-search engine wipes out all pr0n"

So you found the whole Internet ... by jolshefsky · 2003-04-27 23:19 · Score: 1

I'm not sure I understand the point--this search would probably return about 80% of all web pages because they contain an e-mail address like string. I think spammers will still just crawl pages and use that form of regex to collect addresses, not to search for pages that contain those addresses.

The original post is sounding more and more like FUD to me.

--
--- Jason Olshefsky

Karma: Poser (mostly affected by adding this line long after everyone else did)

Metasearch by Anonymous Coward · 2003-04-27 23:39 · Score: 0

I would find it useful to be able to apply a regexp search to the output of an existing search engine. For example, I live in a city called "Reading": this produces loads of false positives on Google when I do a search on "reading cinema" as Google isn't case sensitive. I'd like to be able to do the search on "reading cinema" with a nominated search engine, then search within the first-level links returned by the search engine using a regexp.

Dividing into search and meta-search would also allow me to use a private search engine, such as the one for our intranet.

Slashdot Mirror

Any Interest in a Regexp-Based Web Search Engine?

51 comments