Mastering Regular Expressions

← Back to Stories (view on slashdot.org)

Mastering Regular Expressions

Posted by timothy on Tuesday June 24, 2003 @04:15AM from the bondage-and-domination dept.

gianluca writes "Having always been a heedful guy, I always duly did my homework, going through the lengthy manual pages of a number of regular expressions (regex) crunching tools. You name it: be it PERL, awk, emacs, sed or even one of the .NET framework languages -- any such program provides support for the same regex expressions (or at least, so they seem to the occasional observer). After some years of regex practice with these tools, I had the pretentious conviction that I knew my way through the intricacies of patterns, grouping, greediness, and the like. When I first stepped into Mastering Regular Expressions, looking at the nearly 500 pages which build up Friedl's book, I wondered what could someone ever have to say about regexes to fill so many pages." Gianluca ended up finding plenty of worthwhile content; read below for his review. Mastering Regular Expressions, 2nd edition author Jeffrey E. Friedl pages 460 publisher O'Reilly rating 9.5 reviewer Gianluca Insolvibile ISBN 0596002890 summary An in-depth guide to lead the apprentice to mastering regular expressions' wizardry

My first suspicion, I admit, was that I was facing one of the countless "man page reprints" that you find these days. It was only after reading the book that I eventually understood: before then, I had had no idea of what regexes were really about.

What it's about The book is logically divided into three parts: the first one (Chapters 1, 2 and 3) introduces the reader to the basic concepts of regexes, building a common ground upon which the subsequent chapters will be based. The introduction is clear and straightforward, and lets the readers quickly grasp the key points in the regex business. This part is more or less a good summary, presenting information that can be found also in existing manual pages (albeit presented in a distilled form, which lets you perceive that the author has very clear ideas about the matter). If you already know something about regexes, you could skip this part entirely -- even if reading it turns out to be a nice occasion to brush up and overhaul your knowledge.

The second part (Chapters 4, 5 and 6), is the one that struck me most for the depth of provided information and the richness of though. Rather than throwing at the reader usage dictates on one or another regex flavour, the author explains with a wealth of details the inward mechanisms which make regexes run and how you can exploit such knowledge to write better expressions.

Chapter 4 presents the different families of regex processing engines (namely, DFA, traditional and POSIX NFA), whose internal behavior differs so greatly that writing a regex in the appropriate way can make a substantial difference in both efficacy and efficiency. If you thought you knew it all about greedy and lazy regex operators, possessive quantifiers, backreferences and lookaround, you'd better think again: I was pleasantly surprised to discover how ignorant I was (to be honest, I had never heard of lookaround operators before!).

Chapter 5 slows down a little bit to let the reader absorb the massive previous chapter. Some simple (but still tricky) examples are presented, showing how to apply the techniques explained up to this point. A couple of examples are perhaps too contrived (ever needed to match aligned groups of 5 digits in an unspaced stream of characters?), but it is instructive anyway to follow the reasoning behind the construction of a complex regex.

Chapter 6 focuses on efficiency, considering how backtracking and matching can drive your regex engine to exponential complexities. Optimization techniques are then presented, first by explaining the automatic optimizations performed by the most common regex engines and then by giving a practical list of hints that you can follow to be sure that your expression will run as fast as possible. Again, I was quite surprised to find out how small changes in a regex can make such a big difference to the engine (and give rise to noticeable performance penalties if ignored).

What I absolutely liked most was that the author explains exactly why a certain optimization works, based on the information given in Chapter 4 (and provided that you have been able to assimilate it in the first pass). Finally, a paragraph entitled "Unrolling the loop" really put me in a good mood, reminding me of the past times of "old school" asm programming.

The third part of the book devotes three chapters to PERL, Java and .NET, respectively. Each chapter goes through the syntax and features of regexes for each language: while the information provided on Java and (VB).NET is quite commonplace, in the case of PERL the author deals with aspects rarely covered elsewhere, like dynamic regexes, embedded-code constructs, regex-literal overloading and specific optimization techniques.

What's to like In one word: insight. The author is definitely knowledgeable of regular expressions and the whole book is filled with thoughtful suggestions and hints. Still, a friendly and straightforward writing style makes reading pleasant and seldom boring (well, you wanted details, didn't you?) while you learn internal regex mechanics rarely available elsewhere.

A further nice point is the broad view offered to the reader, starting from regexes in general and focusing on specific flavours only in the final part of the book. The second edition also offers up-to-date information, covering the .NET framework and the latest versions of PERL (5.8) and Java (1.4).

What's to consider Despite the book's reassuring conversational tone, dealing with such a specific topic with so many in-depth details might sometimes become boring, especially if you do not have a strong interest in getting the most out of regular expressions or in knowing how they internally work. If you are just an occasional regex user and dwell in manual pages, you can probably live without this book. Also, it is a pity that specific sections on Tcl, emacs and awk have disappeared in the second edition (maybe they were not as current as the .NET framework ?) and that pcre (a C regex library) is barely mentioned. The summary Regular expressions are tied so strongly to the *nix culture that everyone who has been exposed to that culture has come to use them in a more or less conscious way. Still, most of the documentation around lags on basic features and presents only the most common regex operators. Mastering Regular Expressions is the book to read if you want to go further and get serious about regexes: even if extreme optimization might not be a big concern today, understanding how regex engines work under the hood greatly helps also in creating everyday small expressions. Table of Contents Preface
Chapter 1. Introduction to Regular Expressions
Chapter 2. Extended Introductory Examples
Chapter 3. Overview of Regular Expression Features and Flavors
Chapter 4. The Mechanics of Expression Processing
Chapter 5. Practical regex techniques
Chapter 6. Crafting a Regular Expression
Chapter 7. Perl
Chapter 8. Java
Chapter 9. .NET

You can purchase the Mastering Regular Expressions, 2nd edition from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

13 of 252 comments (clear)

Min score:

Reason:

Sort:

Perl, Java, .NET.. oh my! by Gortbusters.org · 2003-06-24 04:19 · Score: 3, Interesting

This sounds like a nifty tool for those who have to switch programming environments quite often. I always find myself going back to the books when I either have to write a regex myself or decypher someone elses crazy looking expression.

--
--------
Free your mind.
Don't go overboard by apsmith · 2003-06-24 04:20 · Score: 3, Interesting

I read the first edition of this book - it was great, and completely changed the way I handled (and understood) perl regular expressions. It's tempting, after reading this book, to try to apply regex's to everything! Friedl had an example of a huge, horrible (but efficient) regex to parse mail headers in the first edition - my advice on that is, don't try that at home! Interspersing procedural logic with the regex's tends to make much cleaner and more readable code...

--
Energy: time to change the picture.
Different than 1st Edition? by khef · 2003-06-24 04:22 · Score: 3, Interesting

Can anyone that's read this describe what's changed from the first edition? Is it worth shelling out the cash if you already have the first one?
Re:My problem with regular expressions... by Gabe+Garza · 2003-06-24 05:06 · Score: 5, Interesting
Amen!
I think a lot of the people who use RE's a lot would be well-served by brushing up on their recursive-descent parser writing skills. For only a little more time then it takes to write a regular expression, you can (if you know how) write a simple recursive-descent parser that:
- Is more readable (and thus maintainable)
- Is more efficient
- Has the potential to have much better error handling (e.g., a descriptive message instead of just "RE doesn't match! Ack!")
- Is much more scalable: recursive descent parsers can easily scale up to parsing an entire language (witness g++, which uses one to parse C++)
- Is likely to be a great deal more correct, because it forces you to actually define a language, instead of just iteratively building up an RE
Re:Why is it that people think regexps are hard? by Anonymous Coward · 2003-06-24 05:43 · Score: 1, Interesting

It sounds like you knew exactly as much about regular expressions as the reviewer did prior to reading this book.
The difference is that the reviewer was smart enough to suspect there was more to it, and he was right, and you are not.
"regular expressions" is actually a misnomer given the power of modern regex tools. They have all those fancy things like context that you like so much, having mastered what I am guessing is a whole semester of compilers and theory of computation classes.
Re:All i have to say is: by Bedrock · 2003-06-24 06:20 · Score: 2, Interesting

Another fun one is the REX shallow XML parser algorithm that's been around for some time. Check out http://www.cs.sfu.ca/~cameron/REX.html and scroll to appendix A for a Perl implementation. I recently had to reverse-engineer this approach and write a stack-based parser to run in an environment where Perl's :?$foo construct was broken. Much fun...
Parsers and regex by Beltway+Prophet · 2003-06-24 07:03 · Score: 2, Interesting

And recursion is lots of fun, but I use REs to recognize and extract tokens and boundaries, because it's so easy to write and change simple REs.

There is a middle way between overly complex REs which mere mortals cannot read nor safely modify, and overly complex parsers that never take advantage anything more functional than getc().
I've read the first edition and... by RevMike · 2003-06-24 07:14 · Score: 4, Interesting

I have to agree that this is a book that should be on everyone's shelf.
The very fact that both vi and emacs support regular expressions must mean they are a best-in-breed tool, because if there was a way for those two communities to disagree, they would have done it.
I love the fact that I can use the same expressions with grep, sed, vim, Perl, and Java. that being said, however, the critics are who warn that regex can be over used are correct: regex's are difficult to debug and to maintain, so don't go overboard.
Re:You actually liked this book? by melonman · 2003-06-24 07:15 · Score: 3, Interesting

I loved the first edition, probably for the reasons you didn't. I'd read several short overviews of regexes, including Larry Wall's one in the Camel book, and, while they got me doing simple stuff, they left me with lots of unanswered questions, and the more I experimented the more my "why doesn't that work?" list grew. The Friedl book is totally thorough, and, I thought, aggessively pedagogical, if you want to learn about how a regex engine works rather than pick up stuff in a cookbook fashion.

That said, I do wonder about the guy. The colophon was astounding: he wrote half the book using regexes on a computer on the other side of the world, using a 37.5 bit/hour connection by the sound of it, and then he proceeded to write his own typesetting system so he could produce a phoenetically alphabetical index in English, Japanese and probably some other languages that I missed. I think he ought to get out more...

--
Virtually serving coffee
contrived examples? by anonymous+loser · 2003-06-24 07:25 · Score: 4, Interesting

(ever needed to match aligned groups of 5 digits in an unspaced stream of characters?)

Yes, actually. Older FORTRAN codes (that have been slowly added to/modified over time) especially exhibit this kind of behavior thanks to formats that allow you to specify columns for output. The numbers actually run into each other on the line, and the only way to read the file is to know which column the data you want is in. I would never discount any regular expression example as contrived. Somewhere, someone has developed a program that uses that formatting in an input or output file, and someone else might need to be able to speak it's language in an automated fashion.
Re:They can be hard by Anonymous Coward · 2003-06-24 08:01 · Score: 1, Interesting

I found it difficult to write a regular expression to define a c-style comment: /* comment */ Well, not impossible, just more difficult that I thought it would be. I posted my thought process about how I constructed a regular expression to pick out a c-style comment on my website. It's the kind of thing I like to ask interview candidates.

Nice you ask your interview candidates. Hopefully one of them recognizes that your webpage's regexps fail in the presense of escape characters

/* comment *\ /
Re: C++ Regular Expression library by wbniv · 2003-06-24 14:40 · Score: 2, Interesting

i've been looking at the boost c++ regex library http://www.boost.org/libs/regex/ and i'm going to give it a try. as i'm doing more c++ programming these days ( i've been lucky to have been doing perl for the last couple of years :) ), i've been looking for quality, cross-platform, license-compatible c++ classes; boost seems quite good (and it's peer-reviewed, too)

i also just found this benchmark http://research.microsoft.com/projects/greta/regex _perf.html comparing boost vs. microsoft's greta http://research.microsoft.com/projects/greta/ which gives you "all the power of Perl 5 regular expressions in your C++ applications. These easy-to-use classes let you perform regular expression pattern matches on strings in C++." (from the website)
Re:My problem with regular expressions... by cgibbard · 2003-06-24 15:19 · Score: 2, Interesting

There are some really nice parser combinator libraries that are beginning to show up for constructing parsers very easily.

For example, check out Parsec
a monadic parser combinator library for Haskell. Have a look at its documentation to get a feel for what I'm talking about.

The basic idea is to construct parsers for complicated things by combining simpler parsers in various ways.

Parsec also has mechanisms that allow one to provide information so that good looking error messages can be constructed for the user should the parse fail, and also allows for one to tune parsers for efficiency reasonably well, while maintaining a fair bit of flexibility (by default, parsers are LL1 for efficiency, but they can be extended so as to provide potentially infinite lookahead if necessary).