Next Generation Regexp
prostoalex writes "Jeffrey E. F. Friedl, author of newly published 2nd edition of Mastering Regular Expressions, wrote a feature article for O'Reilly Network on the recent innovations in the regular expression world. You'd think that such area as regular expressions would be fairly stable, but according to the author, 'when I started to work on the second edition of Mastering Regular Expressions and started refocusing on the field, I was rather shocked to find out how much had really changed'. The article's behind-the-scene purpose is apparently to push a new book that O'Reilly published this month, but it has great educational value for anyone involved with practical extracting and reporting."
I particularly like this bit:
Nice to see that things haven't changed muchAmazon has slipped the shipping date twice. I don't know about you, but this book is definitly a "Must Have".
"God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
Perl6 is going to radically change regular expressions as well. I guess the term "regular expression" is pretty vague/useless these days. You have to identify the language _and_ its revision to get an accurate idea of the regexp feature set you're dealing with. Just throw some variables and control structures into regexp and we'll have a full-blown extremely cryptic language. Maybe we need a RegExp Institute of Excellence with yearly meetings in Sweden or something.
Other than to tell us what is different between the two books. After reading the article I walked away with no general knowledge that was useful in using regular expresions, or what might be coming, or where we came from.
It is a slightly wordy advertisment for why you should upgrade. The fact that it was foisted on us as something else annoys me, as I spent time reading it.
I know, a slashdot reader that actually reads linked stories is such a minority, but come on, quite stuffing articles with advertising. Aren't the ads in the middle of a page enough?
He doesn't even mention the radical changes to regexps in Perl 6, as described in the recent Apocalypse 5 and Synopsis 5.
That is one of the most contentless articles I have seen in a long time.
A regex is a type 3 grammar. Type 3 grammars haven't really changed since Chomsky's time.
The smartarses will now proceed to point out that
a) Perl is actually limited type 2
b) Some change noone knows or cares about was made to some definition of the Chomsky hierarchy in ninteen dumdy-dum.
Foo.
"Think about it: when you are looking for wares or porn, where do you go? Perl? Nope. IRC. Why? Because of the human element ... but for serious data prowling, you need something with a brain and a heart."
A heart for porn?
Buying a Dell computer is equivalent to dropping the soap in a prison shower.
Perl and other languages should leave "good enough" alone when it comes to regular expressions and instead just make it easy to put chunks of grammars into programs.
Over the course of my career I have come to the rather firm opinion that you are not worth much as a coder if you do not know regular expressions. I don't care what language(s) you're proficient in, or if you've memorized every single design pattern the GoF has ever conceived, of do 4 foot by 6 foot UML diagrams in your head. If you can't do regexps then you're missing a basic skill. I bought Friedl's book a couple of years ago, and although I wound up not using man of the Perl related stuff the rest of the book helped me out immensely.
A programmer without knowledge of regular expressions is like a carpenter without a hammer.
Case in point: Six months ago I was handed a printed copy of our family that was to be published by my late uncle. About 1500 pages of history and geneology. After using a scanner and OCR to get the raw text I used Regular Expressions to:
1) Identify heirarchical relationships that were only denoted by standard oldered list types (1,1a,2,2a,3, I, II, etc).
2) Insert html markup to reproduce proper highlighting for names and indented lists.
3) Generate internal HTML links between individuals, their unique GEDCOM (LDS Geneology)number within the document.
4) Build an index for chapters and an appendix to link from name, sorted bu surname back into the main document.
5) Add special markup for converting the end HTML into indexed and linked PDF using HTMLDoc.
Time to complete the job -2 Weeks. Without the use of Regular expression this task would have been alsmost impossible and all my Uncle's work he did to put the information together for the last two years of his life would have been lost.
"God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
Who said anything about the Internet? Honestly I didn't read the article but I do have the first version of the book they are talking about and it has nothing to do with the internet rather pattern matching in programming.
That is why research into regexps is doomed to failure. It is a dead end. From a theoretical standpoint, regexps are cute and interesting, but for serious data prowling, you need something with a brain and a heart.
While I agree that for large amounts of data you need something other than a regex, but that certainly doesn't mean that regexs are dead or that we shouldn't try to make them better! I don't need Google's search algorithm to make sure my user's input matchs certain parameters and I would really hate to have to write
if $input contains really_evil_characters() die;
Regex is here to stay
The Anti-Blog
Regular expressions haven't changed since the seventies, at the latest. Now if you want to say that implementations of regular expressions are advancing, fine. Let's be precise in our use of language, or not.
Ummm. Are you a programmer? Sure, you don't need Regexp's to solve every problem (and probably don't need them for MOST problems), but there are many problems that are solved so much more elegently WITH regexp's than without that once you understand them, IF you are a programmer, you wouldn't give them up. They are invaluable tool in a programmers toolkit.
Time to complete the job -2 Weeks /me ducks and runs
That's pretty cool...regexps let you finish jobs two weeks before you start them.
Who modded this guy up?
Regex's aren't useful because you don't use them on the Internet? What kind of argument is that?
"Let's be precise in our use of language, or not."
Very compressed contentlessness.
"I see that you are writing a regular expression"
-- Knowing too much can get you killed, but knowing who knows too much can make you rich.
yeah... just look at perl, a language that is tightly tied to regular expressions. it's only one of the most-used programming languages on the planet.
i don't mean to be a regexp or perl nazi, but i've found regexp's to be nothing but useful for the times when a split-by-whitespace isn't enough and a full-fledged parser is too much.
regexp's won't solve world hunger or cure cancer, but last time i checked there were alot of folks using perl and grep.
I have the first edition of "Mastering Regular Expressions" and it is indeed a very fine useful book.
For a nice way to get started with regular expressions I recommend the wonderful "txt2regex" console program. It provides a simple text based wizard-like interface. You answer questions and the program builds your regular expression for you. See:
http://txt2regex.sourceforge.net/
Stupid job ads, weird spam, occasional insight at
Where are moderator points when you need 'em?
To those who can't read (or write) them, regular
expressions look like line noise. But once you learn to read them you can condense whole paragraphs of spaghetti conditionals into a single, clear (to the initiated), terse line.
For manipulating strings of characters, they are probably the single most important innovation of the last 20 years.
[talk about regexps are not so usefull..., but ...] What has become useful is what Google [google.com] taps into. And that is the human aspect. Data isn't important because it matches a*(b|c)a*. It's important because it is useful to people. Think about it: when you are looking for wares or porn, where do you go? Perl? Nope. IRC. Why? Because of the human element.
I understand your thinking.
But your thinking is wrong.
Think about it (no pun intended).
How much better would google be if one could use regexps in one's search request.
regexp and datamining are orthogonal.
Regexps are interesting, sure.
Not really. I use them all the time and the only time they are interesting is when you're done and they look completely silly.
Every CS student enjoys (or suffers through!) the regexp section of their Intro to Computability (or equivalent) course.
Not really. I got a degree in Computer Engineering from the #2 private engineering school in the country and I was never taught regex. If you know how to program and not just crank out syntax, you can pick up regex on your own pretty fast.
And it is pretty fun thinking about the expressive power of, say (a|b)*a*b*
That is actually a really boring regex. Lots of a's or b's folowed by lots of a's followed by lots of b's. Wow. My brain is fried.
However, we have to face the facts, that regexps, as good as they are from a mathematical standpoint at matching things, just aren't that helpful in sorting through the sea of data that is the Internet.
Wow. You're probably right. I'll bet nothing that searches for things on the internet, such as google.com, uses any regex internally in their code. Now that I'm facing the facts, you're right, regex is worthless when it comes to searching through any amount of data.
The input data just aren't orderly enough for regexps to be of any use.
Yeah, regex is best used for very very simple patterns. Anything more complex than your above example is best suited for some serious hand-parsing in visual basic.
Think about it: when you are looking for wares or porn, where do you go? Perl? Nope.
I don't know WTF you're talking about. I find ALL my porn at www.perlmonks.org
That is why research into regexps is doomed to failure.
Yeah, I should probably throw away all that perl regex code I've written thats made my company lots (and I mean lots) of money in the market. It is doomed. I should writing my pattern matching code in the google.com language.
Thank you for posting about something you apparently know very little about. Good for an afternoon giggle.
...we could find everything we wanted, not only nearly everything we're looking for on the internet. Imagine the power of pattern matching (ie Perl compatible regexps) on Google's database!
:-)
Seriously, how hard do you guys think it would be to implement such a feature? I can't imagine they haven't played with the idea -- they've probably estimated the need for additional resources, and thrown it right away
Maybe to a certain small class of people, "regular expression" means what you want it to mean. To 99.99% of the people who use the phrase, it means what the book describes, and those things have changed considerably.
Many precise mathematical or scientific terms have different meanings to laymen. What is a positive number? I'm sure I learned whether 0 is a positive number way back when, but right now it simply doesn't matter. Context is usually good enough, and when not, > and >= work wonders. Quantum leap as used by mere mortals has the meaning of incredible revolutionary exciting change, but scientifically, it means the smallest possible change.
So foo to you.
Infuriate left and right
You forgot to mention that it also does what you're trying to accomplish as a side-effect of what you actually told it to do.
That's the thing that bothers me the most. The examples people are giving here are symptoms of the real problem. If you want to know if this piece of data you're looking at is a number, use a programming language that supports numbers and see if you can make a number out of it, then use the number.
Want to know if your input contains ``nasty characters?'' Wrong, you don't. You want to know if it's what you are expecting. At best, you want to know if it contains only characters you have authorized in a proper sequence that makes sense. You can sometimes do this with a regex, but not treating your data as strings in the first place gets rid of most of the desires to do textual processing on it. It makes your application development easier, more correct, and easier to read. It also makes it less likely to need to be modified down the road when you realize you overlooked unicode conversions, null characters, or other things that cause security holes.
-- The world is watching America, and America is watching TV.
The original poster says that the "behind-the-scene purposeis apparently to push a new book that O'Reilly published this month". Actually, that's pretty much the main point of the article -- to justify the need for a second edition, and to let people know what they'd get (or, if not interested, what they're passing on).
I wrote the article so that people would have a feel for what's new in the book. Of course, my hope is that people are interested in the new content, but my general feeling is that the worst that can happen is that someone buys the book and finds out that it's not what they expected. Unmet expectations pretty much suck, and I hope the article helps avoid some of that suckage.... and piques some interest, as well.
Jeffrey
One of the important aspects of using regexes is to know their limits and not try to use them outside of those limits.
What a fool believes, he sees, no wise man has the power to reason away.
It's not just a Perl book, but the language independent and Perl dependent parts are a godsend.
I was a full time Perl programmer (with a two hour commute by rail) when Friedl's book came out. I read it cover to cover, and then recommended it strongly to my co-workers.
Friedl shows how to write powerful, readable, efficient regular expressions that can do a lot of the work your program needs to do. It changed how my group wrote Perl (very much for the better). This is more than highly recommended; after the Blue Camel, and even before the Cookbook, this is a definitive book for all those who call themselves "Perl programmers."
(In the first edition of the book, Friedl discovered some problems with regular expressions in early versions of Perl 5. The very next release of Perl -- 5.003, I think -- immediately fixed these problems. When Larry & Co. pay attention to a Perl book, maybe you should, too?)
Stupid job ads, weird spam, occasional insight at
Mark me as a troll or whatever but, "What are regular expressions?"
/:+[^:]/ statements? whats the big deal then?
are they those
I'm really, really new to perl, studying it out of an O'Rielly book. What does this mean to me?
forget it.
Time to bust out the books and get cooking again.. I still need to catch up on *current* regex.. They keep changing some of the functions around in PHP as well also (anyone else had this issue?) which break code when you switch from one box to another. Anyway... This looks interesting..
anime+manga together at last.. in real time.
Regardless, I knight thee sir bunratty.
--Giving to trolls for the benefit of us all
Way back when there was a programming language called "Snobol". It still lives (www.snobol4.com for a good starting point).
Snobol is *THE* string pattern matching language. Nothing else beats it (and I've been playing around with string processing languages for over 20 years).
Yes.. it's syntax is different and the language hasn't changed in years (decades?). But it does the job exceedingly well.
You might also want to take a look at the Icon programming language (www.cs.arizona.edu/icon).
Icon was developed by some of the same folks that developed Snobol. While not quite as powerful as Snobol in terms of expressing patterns, Icon extended some concepts. You can build up your own pattern matching functions.
One of the best quotes I saw in an discussion concerning Icon and regular expressions (the discussion was that Icon lacked a builtin regular expression facility) was
"Putting regular expressions into Icon would be like putting training wheels on a Harley" -- (I really wish I could remember who said that).
Anyway... just something you might want to check into.
Well, maybe they thought about it. But if they only implement a (non-trivial) subset of regexp search I will admire them even more.
Regexp are horrible from a complexity point of view.
According to this link regepx's complexity is of O(M*N), where M unfortunately is in the order of Googles DB, if my short calculation is correct. Note, this may be wrong, but the point stays that regexp searching is quite expensive and kills most of the optimizations you could do if you didn't want to provide them.
This is exaxtly why I miss Alta Vista advanced search. I could frequently drill down with much more precision than Google will allow. The only reason I switched to Google was because they started indexing more pages than AV, and Alta Vista eliminated the all text version of their search page.
The problem with regular expressions is that there are so many constraints. for example:
- \<John.+Doe\>
should match:- JohnClark
- ...text...JaneDoe
But this shouldn't match:As you can see, even with a very simple regular expression like this, the text has to be processed a lot to get the results needed. A simple "John AND Doe" would match all of the results while the regular expression puts more restraints on the search, which takes longer to process. For complex regular expressions, the searching of text becomes too slow for large amounts of data, such as the internet.
...interesting if true.
Whether you love Microsoft or hate it, there's no denying the popularity of Visual Basic. With the regular-expression package in the .NET Framework, Microsoft provides a package that can be used by VB.NET, C#, Visual C++, and any other language that wants to link to it -- even Python and Perl! The consistency is appealing, but even more important is the package itself: it's powerful and fast, and can it can hold its head up high next to Perl or any other regex package out there.
VB's regex syntax is exactly like Perl's. In fact, when I started working with regexes in VB and I couldn't find something in the documentation I would look it up in one of the O'Reilly Perl books. Much to my "shock", I could do everything Perl regexes could do, even the things that weren't in the documentation.
I strongly suspect Microsoft took full advantage of Perl's "artistic license" when they came up with their regex engine.
No, Thursday's out. How about never - is never good for you?
I understand that Perl 6 isn't near being done, and that the "r" in "Perl" doesn't necessarily stand for "regex", depending on who you ask, but Perl will always have the greatest influence over what is called a regex. Or is that going to change with Perl 6?
It means greater than 0. You use "non-negative" to include 0.
I recently wrote a program taking a SCS data stream with embedded DJDE's that was intended to go through an SCS->3211 converter (that is the data contains both line printer commands and 3211 syntax command that get passed through) and converted this data into AFP. You bet I needed power regexes for this.
You use regexes to write parsers and parsers can't treat their strings as just "data".
Well, I don't find it fair that you were modded as a troll. You may be just misinformed.
I can tell you that _any_ decent *nix gives you complete knowledge of what is going on in your machine. Without having to look at source code, without having to go to some central repository of information.
Now, press Ctrl-Alt-Del in your favorite Windows and take a look at the name of the services. Try to enter any of them in the MSDN search. What do you see? Do they tell you what that service does? How is it started? How can you stop it?
Do you still praise MSDN so high when you see that they don't even tell you the basics?
Regular expressions are old hat. I'm much more interested in the advances in irregular expressions, as used in the old Firth and Pasquale languages.
But, of course, everyone knows that a real coder uses irregexps in disassembly language.
A Perl "regular expression" is more powerful than a mathematical "regular expression." Perl's can do backtracking, which a finite automaton can't do.
The Perl "RE" "(a+)b\1" will match aba and aaaabaaa, but not abaa or aaba.
tough competitive force. . . . It's non-traditional, it's free and it's cheap", Steve Ballmer about Linux
Hmm, methinks Steve is using cheap as in "of inferior quality or worth" or "contemptible because of lack of any fine, lofty, or redeeming qualities." Webster's. Personally I think the word is best applied to Windows.
What those who want activist courts fear is rule by the people.
I'd like to take all my existing regular expressions and run them through regular expressions to turn them into new age regular expressions. Can I do this, or will the universe implode?
Skiers and Riders -- http://www.snowjournal.com
Then Perl's use of the term "regular expression" is a misnomer. This post makes the correct analysis of the situation. Those who came up with Perl's "regular expressions" should have thought of (or invented) a more accurate term.
An essential book if you ever use perl, php, e?grep, sed, awk, vi, or a number of other programs.
Do you even lift?
These aren't the 'roids you're looking for.
Regular expressions - making line noise useful since 1956!
Julian
(btw, it's a "Internet RFC standard compliant email address matcher")
You know what I'd like? A regex syntax I could use in shell scripts that would take less time to debug than the equivalent C++ program.
-a
How to rationalize theft.
(* Over the course of my career I have come to the rather firm opinion that you are not worth much as a coder if you do not know regular expressions. *)
That can be said about anything. IMO, many OOP fans were simply crappy at procedural/relational programming and design (either due to lack of training, or a non p/r mind). The faults they often find with p/r are their own bad thinking about p/r, and not OO's strengths.
I think reg.ex's would be easier to learn and read and remember if they were broken down into user-definable chunks of some kind. It could be more like defining a generational grammer (substitution): you define the symbols rather than live with what Larry Wall or whoever picks. A special set of functions or operators would simplify the defining of the symbol sets.
Further, I would like to see the peices parsed into a table (or some easy-to-navigate structure) so that second passes can be done. In other words, divide up per-character parsing and per-token parsing.
I admit that it may not be as compact as regexp's, but easier to read for those don't need it every day.
Regular expressions come across as a stringy diarreac glob of an irriducable mess of symbols if you don't keep up. It is like forgetting to ride a bicycle if you do not do it every 3 months or so to refresh.
I realize that everybody is different, and what bothers me may not bother others. I just don't personally like the approach resexp's took. I would like to see it broken down into clearer chunks. IOW, the syntax would (clearly) dictate the chunks instead of running the rules in one's head to find the boundaries and context.
I know I will get called a bunch of names for saying this all, but that is my opinion, take it or leave it.
Table-ized A.I.
Another tool is shell scripting. At a past company Symantec Cafe was used for developing a Java application. When I joined, I immediately created shell scripts for myself to do automated builds for a couple reasons:
I showed others how to use them, but only one other developer took the time to get used to it, never having used a shell before. The others complained that they shouldn't have to learn a new tool (shells and scripts) when Cafe sufficed. I explained the advantages, but to no avail.
Well, a few months later we finally hired a real QA and release engineer. Since we were building a J2EE application to run on Linux in testing and Solaris in deployment, we needed automated builds on Unix. There was a huge rush to get everyone up to speed on the new build system using shell scripts.
Hmm, that was a bit long-winded just to make the point that there are many useful tools to developers that don't involve the actual code they write. I've used regexps to create SQL data files and config files as mentioned. You'll learn many things, so keep open and don't stop learning. :)
Freedom to fear. Freedom from thought. Freedom to kill.
I guess the War on Terror really is about freedom!
> Not really. I got a degree in Computer Engineering from the #2 private engineering school in the country and I was never taught regex. If you know how to program and not just crank out syntax, you can pick up regex on your own pretty fast.
To be a little pedantic the original poster probably meant being taught regular expressions in a formal language theory framework, where one talks about properties of computability. The same course would teach things like finite state machines (which in terms of computability are equivalent to classical regular expressions though I think not to perl regexps), context free grammars (pushdown automata) and turing machines, and just general computability (and maybe complexity) theory. All of these things have a great deal to do with how programs work, and the lack of such theory is probably actually one of the drawbacks of doing something like computer engineering over computer science. (at least from my perspective)
For the ultimate in regex'ing ... hardware regex accelerators!!!
I'm so glad to know my PC has an infinite amount of memory to parse XML! Jeez, I guess I didn't need to buy that extra stick of RAM after all.
If you have a brain, none of the things you mention is a "basic skill". They are merely trivial implementation details. The only basic skills for a programmer are a) problem solving skills, and b) comunication / interpersonal skills.
So don't say you know how to write regular expressions and expect anyone to care. It's really no more important than knowing the Windows API, for instance. In either case, any programmer worth his/her salt can learn the required details with minimal effort. Now, if you can implement regular expressions, then I might be mildly impressed.
And it doesn't help you one bit, since you've got no way of verifying whether that mail address actually exists, even if it is RFC-compliant.
Well, they were originally from grep (general regular expression processor), which, AFAIK is equivalent to a finite autamaton. Perl/awk/sed adopted these, and then extended them. It would have been kind of awkward to start calling things which used to be regular expressions something else halfway through the evolution of a tool. But yes, if you want to be pedantic, Perl's "regular expressions" are too powerful to be called by that name.
When I bought 'Mastering Regular Expressions' I figured I was buying a book that wouldn't go out of date. Like all the classic O'Reilly books, it wasn't something trendy, it didn't have a software version number in the title....
Something makes me suspicious that the author just bought a new power boat and 2nd Edition buyers are paying for it.
I think he meant "#2 from the bottom".
The computer engineering program at my school, UMBC, is exactly the same as the computer science program except you are required to take additional courses which would otherwise count as elective credit if you took CS (stuff like more physics and an electrical engineering course).
when people become intelligent enough to use it and Arthur finishes K4. Watch Kuro5hin within the next month for an Introduction to K to appear (I will also submit it here, but I doubt it will get posted).
that link only considers non-deterministic regexps. If you don't need backreferences, you can do deterministic pattern matching, which is inexpensive once the regexp is compiled into a lookup table. I think ACM from a couple months ago had an article on some research into deterministic regular expression matching with limited backreference support. Anyhow, if you read the google white papers and the founders' graduate student research on searching, it's fairly clear they do hierarchial keyword indexing, which is good for fast lookups, but the data isn't well formatted for regular expression processing. You could write a tool to use the google api to do a preliminary search on constant words in your regexp and then have your client do a regexp search on the results. Hmm, I may have to try that for the next google programming challenge :)
(GNU egrep does something similar by doing an fgrep on static terms before doing the full-blown regexp search)
Do you even lift?
These aren't the 'roids you're looking for.
- Text processing - why isn't your text marked up? Converting data into text, passing it along, and then trying to pluck the data back out of the text is brittle and leaves you with a system that can't be upgraded - your components can't be improved to produce a more informative text stream as it will break all the regexpr's of all the components that use that stream etc.
- Parsing - how many times have you encountered a HTML or XML parser written with a regexpr? Unless your job requires you code by the seat of your pants, this is just plain lazy. Parsers written with regular expressions are always incomplete (ie they work on the subset of HTML/XML they were tested on, and if the requirements or layout ever changes they break), and they are very slow compared to a proper parser. Proper robust and well tested parsers are available under most licenses and for most languages.
- Development - Regular expressions appear to be developed with a 'try it and see' methodology - people write the regexpr and test it, thinking if it works then they must have done it right. This is very brittle, I've ecountered many regexpr's for email addresses, all of them work on your bog standard address, none of them work when deployed - there's always some guy with a % in their email address or some other oddity the author of the regexpr forgot or didn't know about (and lets not even think about trying to make an RFC compliant email address regexpr, it would have to handle "blarg@wibble"@slashdot.org)
I don't know, there are other issues with regexpressions but I've spend too long on this post already. I'm curious as to other's views on this - I've just come to associate use of regular expressions with flakey or hastily written software.Text straight from the keyboard of a user won't be marked up and seems a good place to be using regular expressions. Due to the popularity of brittle and unupgradable (is that a word?) text processing, the input from other programs might not be marked up either, here regexprs are necessary (ie symptomatic of poor design, but it wasn't your decision).
This applies to much more than just HTML or XML, eg if you're going to write a javadoc clone for your pet language, do it properly, don't do it with regular expressions.
That HTML tag stripper you hacked up, did you remember to handle comments? Just because there weren't any comments in the HTML it was tested on doesn't mean it'll never encounter them in the real world (wouldn't be an issue if an off the shelf parser is used).
Maybe he is an impostor who is trying to get Jeff fired... I mean, would YOU want employees that are willing to spend part of their free time improving the competitors product?
I mean, its perfectly legal, but the point is: How would this look in the Evil Human Resources Dept? Are they going to think about promoting this guy in the near future?
No sig for the moment.
This is incredibly fascinating. I honestly don't know if I'd rather read this article or study up on Boron.
This is a pretty lousy explanation of how things work, although I do agree that regular expressions have limits and you should be familiar with what they are.
I might not get every single bit of the technical vocabulary right on this one, but do try to follow along anyhow (and please only refute the REALLY glaring errors).
Basically, with regular expressions, you get what Chomsky (famed linguist and political extremist...er, nut =) ) referred to as a type-3 grammar, or roughly something that can be solved with a deterministic finite-state automata (DFA. Ok, you might argue that you get an NFA (nondeterministic finite-state automata), but using the subset construction, it's so easily converted to a DFA, we'll just pretend we're working with a DFA.)
Basically, a DFA works like this: Think of a table. You start out at one row (the starting state) and based on the input you get, you move to another row/state. One or more of these states is specially marked as an accepting state, so if you run out of input characters on one of those states, the string is accepted and everyone is happy. If you run out of input on any state not marked as such, the string is rejected. (DFA's are often expressed as graphs as well, but from a programmatic perspective, it's really easy to just use a table, or for the pedantic, a list of nodes (containing the current state, and where to go on any possible input...a list of lists)).
Maybe we can go more simple than that: You're sitting on a nerdly board game. You draw a card that says "B. Go to the square labeled as 'R'", so you go to R and draw another card. You keep picking cards and following the directions on them until you get "$: The game is over. If you're on a square that is labeled with an underlined letter, you win. If the letter is not underlined, you lose."
So what does this mean? In terms of compiler/language theory, we can use a regex to recognize tokens (or individual words), but they aren't very powerful when it comes to syntax (Our lexer would be happy with "sentence This a is.", but by our grammar, it doesn't make a lot of sense. We, as people, could guess the meaning, but computers are still really bad at guessing anything (especially your weight). A parser would be necessary to figure out if things make any sense by the rules of the grammar, which would refuse "sentence This a is." but accept "This is a sentence.") If you're setting up the rules of our example nerdly board game, you could set up a number of states that could find any word in your language. (If the first state is "E" and the next state is "a" and the next state is "t", followed by "$" (commonly used for end of input, but in your language you might specify the end of the word being a space or some bit of punctuation instead), you'll have successfully parsed "Eat", which by your rules is considered to be a valid accepting state. In the same sense, if you pick "R" followed by "Z", you might move off to some error state you've specified, where no matter what input follows, you'll always loop back around to that same state, because you know for certain there's no word you want to accept that starts with "RZ".)
So to answer the question "can regexp validate XML?", the answer is yes, in the sense that it can be used to scan for valid XML components (words), and no, in that it can't tell well-formed XML from poorly-formed XML tags (sentences). A regex alone isn't quite powerful enough to understand that ">>>>XML" and "<XML>" aren't both perfectly acceptable.
Sort of.
Could you write enough rules that some really large set of regex could do it? Maybe, but it's a mathematical proof that's way out of my league, but I'll warn you now: you'll be writing so many cases for every possible permutation that you'll probably go batty trying. Part of what all this language theory got us was an understanding that some tools are good at one task, but lousy at others.
If you're interested in this further, the Dragon book (search for it on google, you'll find it as "dragon book" faster than its real title, which I've forgotten) is considered the canonical source for this sort of thing, although it can be horribly dry and hard to read. There are some other compiler theory books out there, and some aren't quite as dull (though arguably less informative. I wasn't able to prove my nerdliness by reading more than a handful of pages of the dragon book, though I found it to be a great reference for filling in the gaps of the other books (which were more prone to shameless hand-holding))
comp.compilers can be a good source as well, though sometimes a bit intimidating. Read through it, see if you can find references to the stuff you don't really understand, and just try to absorb what's there.
-transiit
Which is why O'Reilly is the first place I look for a book. Ther ratio of well/badly written books is better there than anywhere else. The only books I will order online. All others, I want to page through them in a bookstore first.
Best Slashdot Co
thanks.
Sure, if you're only making webpages.
Insert offensive troll-style sig here. Please mod or respond appropriately.
I'm sorry, but
<John.+Doe>
Should not match "JohnDoe", and should match "John Doe". you need one or more characters between John and Doe in that regexp.
John AND Doe doesn't do shit for you in search engines either. I like the NEAR clause when I am searching for information because I often have to find things like "scanPORT specification" and I end up getting pages talking about a module with scanPORT and the specifications for the module, instead of for scanPORT. Having a NEAR clause or even a <scanPORT.{,30}specification> would help.
- And it is pretty fun thinking about the expressive power of, say (a|b)*a*b*
That is actually a really boring regex. Lots of a's or b's folowed by lots of a's followed by lots of b's. Wow. My brain is fried.Actually that regexp matches any text at all. * is 0 or more matches, not one or more. Personally I think the really interesting regexps use lookaheads but that's just me.
but for serious data prowling, you need something with a brain and a heart.
Infact skip the heart, you just need something with a brain and common sense.
Ah, the "everyday" programming of the "everyday programmer". I would like to see a poll of "everyday programmers" that asks "How often do you use regular expressions?" I would bet a small bundle that most "everyday programmers", which in the real world are mostly business application developers, I think, rarely use regular expressions. Doesn't help much when getting customer info. into or out of a db. It sure doesn't seem to help much in today's "Search engines" that only seem to give me someone else's business card, rather than real info on the subject I've asked about. (Rant pause...)
Your pet peeve is probably your worst personality trait.
You're correct in saying that regexps alone can't validate XML (or any hierarchical structure, come to that). This is an instance of the bracket-matching problem: given a string composed of opening and closing brackets that can nest, determine whether the string is properly balanced or not. For instance, ()() and (()()) are balanced, while (() and (())) are not.
The reason that a regexp can't do this is that it can't keep track of which opening brackets haven't been closed. A regexp has no memory of what it's already seen. All it knows is what state it's in now, and what token is coming next. OK, some programming languages implement regexps in such a way as to provide some sort of memory of what's been seen, but these usually feel like kludges.
If you're prepared to put up with an arbitrary limit on how deeply you can nest brackets, then you can solve the bracket-matching problem with an automaton that has N states, numbered 1 to N. If the automaton is in the state numbered x, that means that it's seen x opening brackets that haven't been closed yet. The instructions for each state would be "if you see an opening bracket, go to state x+1, if you see a closing bracket, go to state x-1, and if you see the end of the string, it isn't balanced." Exceptions would be that in state 1, if you see the end of the string, it's balanced, and if you see a closing bracket, it isn't balanced. In state N, if you see an opening bracket, the brackets are nested too deeply.
Of course, no theoretical computer scientist would ever accept arbitrary limits on how deeply a structure could be nested, which is why you would use a context-free (aka type 2) grammar to solve problems like this one.
Just another wannabe fantasy novelist...
So you miss the chance to build a good user interface. If you verify the email address with a regex, you could present a decent error message to the user that has just made a typo. Maybe he typed "username@hotmail.co", so before sending him an email, you can right on ask for the correct address. If the user leaves your site, you won't be able to contact him anymore.
I go further, even checking for common errors and suggesting the correct address. Did the user entered "username@hotmail.com.br" or maybe "www.user@isp.com", they I warn that the address is probably incorrect and suggest "username@hotmail.com" and "user@isp.com" even before accepting the input.
An user email is a valuable asset. The first step building a long relationship. Don't miss the chance to get it right on the first time.
yeah, after rereading my comment a day later, I realize I did slip up in part of my explanation: If your first state is "E", the next character to input is "a", you would go to state "Ea", and with "t", go to "Eat", which is listed as an accepting state.
Our good friend (or foe, depending on what you're trying to prove) the pumping lemma does give us an idea that N 'A's followed by N 'B's is impossible in this context (or A^n followed by B^n isn't regular).
You've got the easy hack of "Take all possible tokens. Create an additional set of states for each where there's an arbitrary number of parentheses, brackets, etc. in front of them and the same number behind them" (which would be really big, unless you've got a really small list of tokens or cut off the number of parentheses, etc. to a really small number, or both.)
A^nB^n doesn't work. A^mB^n does work, provided you don't try to be too specific about what M or N are, just that they're some nonnegative number.
On another note, it's good news to me that I did get most of the meat of it right. Glad to hear I'm not getting the second-rate education I'd feared. =)
-transiit
Backreferences (in Perl 4 and some other regex libraries, but not in sed or awk), can express things like
{ xxx | x in \Sigma }
which is not a type 2 language:
% cat input
ab
abc
abcabc
abcabcabc
abcabcabcabc
abcabcabcabcabc
% perl -lne '/^(.+)\1\1$/ and print' input
abcabcabc
In Perl 5, assertions provide another form of context sensitive matching.
Meanwhile, I think you're right that Perl 5 regexes cannot express all of type 2,
unless you cheat by using (?{ code }).
Yes, the number of states would rapidly become unmanagable if you tried to hack together a finite state automaton to recognise a language such as A^nB^n for limited n. Actually, the problem isn't so much the number of states as the number of transitions between them, which is roughly the number of states multiplied by the number of symbols in your alphabet. This is sometimes known as the state-symbol product.
I suppose you could use some sort of regexp compiler to take the grunt work out of it. Ah... any programming language that has regexps has one of those built in anyway. Well, being able to write something like /A^nB^n/(n<=3) would be an advance on /^|AB|AABB|AAABBB$/, I suppose.
On the subject of education, I did a one-term course on theoretical computer science in general, and another specifically on formal languages. I can safely say they've been of no direct, practical use, but they provided a good foundation to what I studied in other courses and what I've learned since graduation.
Perhaps it's not important for everyone in computing to know what a DFA is, or understand the pumping lemma or the Church-Turing hypothesis, but it's necessary that some of us do. Otherwise, who will write the compilers for the next generation of programming languages? Who in the team that builds one of those compilers will tell the PHB that (the general case of) the halting problem is unsolvable, so that the marketing department really shouldn't be claiming that it will be able to detect infinite loops in a user's program before running them?
Just another wannabe fantasy novelist...