Slashdot Mirror


Mastering Regular Expressions

gianluca writes "Having always been a heedful guy, I always duly did my homework, going through the lengthy manual pages of a number of regular expressions (regex) crunching tools. You name it: be it PERL, awk, emacs, sed or even one of the .NET framework languages -- any such program provides support for the same regex expressions (or at least, so they seem to the occasional observer). After some years of regex practice with these tools, I had the pretentious conviction that I knew my way through the intricacies of patterns, grouping, greediness, and the like. When I first stepped into Mastering Regular Expressions, looking at the nearly 500 pages which build up Friedl's book, I wondered what could someone ever have to say about regexes to fill so many pages." Gianluca ended up finding plenty of worthwhile content; read below for his review. Mastering Regular Expressions, 2nd edition author Jeffrey E. Friedl pages 460 publisher O'Reilly rating 9.5 reviewer Gianluca Insolvibile ISBN 0596002890 summary An in-depth guide to lead the apprentice to mastering regular expressions' wizardry

My first suspicion, I admit, was that I was facing one of the countless "man page reprints" that you find these days. It was only after reading the book that I eventually understood: before then, I had had no idea of what regexes were really about.

What it's about The book is logically divided into three parts: the first one (Chapters 1, 2 and 3) introduces the reader to the basic concepts of regexes, building a common ground upon which the subsequent chapters will be based. The introduction is clear and straightforward, and lets the readers quickly grasp the key points in the regex business. This part is more or less a good summary, presenting information that can be found also in existing manual pages (albeit presented in a distilled form, which lets you perceive that the author has very clear ideas about the matter). If you already know something about regexes, you could skip this part entirely -- even if reading it turns out to be a nice occasion to brush up and overhaul your knowledge.

The second part (Chapters 4, 5 and 6), is the one that struck me most for the depth of provided information and the richness of though. Rather than throwing at the reader usage dictates on one or another regex flavour, the author explains with a wealth of details the inward mechanisms which make regexes run and how you can exploit such knowledge to write better expressions.

Chapter 4 presents the different families of regex processing engines (namely, DFA, traditional and POSIX NFA), whose internal behavior differs so greatly that writing a regex in the appropriate way can make a substantial difference in both efficacy and efficiency. If you thought you knew it all about greedy and lazy regex operators, possessive quantifiers, backreferences and lookaround, you'd better think again: I was pleasantly surprised to discover how ignorant I was (to be honest, I had never heard of lookaround operators before!).

Chapter 5 slows down a little bit to let the reader absorb the massive previous chapter. Some simple (but still tricky) examples are presented, showing how to apply the techniques explained up to this point. A couple of examples are perhaps too contrived (ever needed to match aligned groups of 5 digits in an unspaced stream of characters?), but it is instructive anyway to follow the reasoning behind the construction of a complex regex.

Chapter 6 focuses on efficiency, considering how backtracking and matching can drive your regex engine to exponential complexities. Optimization techniques are then presented, first by explaining the automatic optimizations performed by the most common regex engines and then by giving a practical list of hints that you can follow to be sure that your expression will run as fast as possible. Again, I was quite surprised to find out how small changes in a regex can make such a big difference to the engine (and give rise to noticeable performance penalties if ignored).

What I absolutely liked most was that the author explains exactly why a certain optimization works, based on the information given in Chapter 4 (and provided that you have been able to assimilate it in the first pass). Finally, a paragraph entitled "Unrolling the loop" really put me in a good mood, reminding me of the past times of "old school" asm programming.

The third part of the book devotes three chapters to PERL, Java and .NET, respectively. Each chapter goes through the syntax and features of regexes for each language: while the information provided on Java and (VB).NET is quite commonplace, in the case of PERL the author deals with aspects rarely covered elsewhere, like dynamic regexes, embedded-code constructs, regex-literal overloading and specific optimization techniques.

What's to like In one word: insight. The author is definitely knowledgeable of regular expressions and the whole book is filled with thoughtful suggestions and hints. Still, a friendly and straightforward writing style makes reading pleasant and seldom boring (well, you wanted details, didn't you?) while you learn internal regex mechanics rarely available elsewhere.

A further nice point is the broad view offered to the reader, starting from regexes in general and focusing on specific flavours only in the final part of the book. The second edition also offers up-to-date information, covering the .NET framework and the latest versions of PERL (5.8) and Java (1.4).

What's to consider Despite the book's reassuring conversational tone, dealing with such a specific topic with so many in-depth details might sometimes become boring, especially if you do not have a strong interest in getting the most out of regular expressions or in knowing how they internally work. If you are just an occasional regex user and dwell in manual pages, you can probably live without this book. Also, it is a pity that specific sections on Tcl, emacs and awk have disappeared in the second edition (maybe they were not as current as the .NET framework ?) and that pcre (a C regex library) is barely mentioned. The summary Regular expressions are tied so strongly to the *nix culture that everyone who has been exposed to that culture has come to use them in a more or less conscious way. Still, most of the documentation around lags on basic features and presents only the most common regex operators. Mastering Regular Expressions is the book to read if you want to go further and get serious about regexes: even if extreme optimization might not be a big concern today, understanding how regex engines work under the hood greatly helps also in creating everyday small expressions. Table of Contents Preface
Chapter 1. Introduction to Regular Expressions
Chapter 2. Extended Introductory Examples
Chapter 3. Overview of Regular Expression Features and Flavors
Chapter 4. The Mechanics of Expression Processing
Chapter 5. Practical regex techniques
Chapter 6. Crafting a Regular Expression
Chapter 7. Perl
Chapter 8. Java
Chapter 9. .NET

You can purchase the Mastering Regular Expressions, 2nd edition from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

20 of 252 comments (clear)

  1. i mastered regular expressions by Anonymous Coward · · Score: 5, Funny

    when figuring out the lameness filter

  2. I was going to read this by L.+VeGas · · Score: 4, Funny

    but instead I *

    1. Re:I was going to read this by nick_urbanik · · Score: 3, Funny
      but instead I *

      ...read spaces to the end of the line, or the next non-space character :-)

  3. Re:Line endings with sed by Anonymous Coward · · Score: 1, Funny

    looks like it.

  4. Obligatory crap regexp joke by BabyDave · · Score: 5, Funny
    Regular expressions are tied so strongly to the *nix culture
    Shouldn't that be .*nix instead?
  5. Re:What's new in this edition? by Anonymous Coward · · Score: 1, Funny

    Yes, they are. The first matches only the string 'Hi there', the second will match 'How are yo' or 'How are you'.

  6. I just can't fathom this by Anonymous Coward · · Score: 4, Funny

    Now, I thought I was reading a simple article about a programming book review. And here I come across this thread of epic mirth. Somehow you have single-handedly crafted a finely-tuned piece fun-joy from what was a rather mundane topic. I just have to page my boss back to the office to see this! Gather round the water cooler old salts and let me spin a comedic yarn I saw this day on Slashdot. Using an asterix to finish a sentence we would have all seen as being finished in a different manner? Well sir, someone set you up the bomb. You have taken that bomb, added the asterix into the mix and exploded laugh-shrapnel into Slashdot proper. I couldn't even scroll down without getting struck in the eye with a piece of your fun-bomb. Mods, mod this man's excursion into the comedy arena as +5 StopItHurts. Here we sit, emotionally spent and basking in the aftermath of your comedic genius. Thank you kind sir, thank you.

  7. regular expressions? by Anonymous Coward · · Score: 1, Funny

    I'd be happy if the editors could master spelling and grammar

  8. Soviet Russia Regex by TheFlyingGoat · · Score: 4, Funny

    s/\A(.*?)\s+(.*)\Z/In soviet Russia, $2 $1s you!/i;

    --
    You have enemies? Good. That means you've stood up for something, sometime in your life. --Winston Churchill
  9. Re:Regular Expressions by pi_rules · · Score: 2, Funny
    Its caldera's c++ portable regex lib.


    Don't! It's probably got a Unix kernel in it. Beware the lawyers.
  10. All i have to say is: by jdew · · Score: 5, Funny

    Thats a big regex
    stupid filter wouldn't let me paste the regex here XD

  11. Why is it that people think regexps are hard? by SkewlD00d · · Score: 1, Funny

    All you have are zero-or-more "+", one-or-more "*", conditional "? or sometimes "[ ]", scan-sets "[a-zA-Z]", grouping "()" or "{}", non-CFG count range {}2,3, sentintel chars ^ $ etc., place-holders for replacement, dont match "~" or "!", match any single char ".", and maybe a few more odds-ends. It's these bozos that think "regexp" sounds cool, but doesn't want to learn what they are. In general, these generalized extended regular expressions are easily implemenatable w/ efficient DFA and NFA->DFA conversion (i hate that algorithm!!!). If you need a 500 page book on regexps, you might want to have a look at a good compiler book (red dragon, etc.) first. Full non-CFG languages are so much more powerful than any regexp could ever dream of being, and more importantly they can have state.

    --
    The biggest trick the devil pulled was letting lawyers become politicians so they can write the laws.
    1. Re:Why is it that people think regexps are hard? by muonzoo · · Score: 3, Funny
      SkewlD00d writes:

      Why is it that people think regexps are hard

      All you have are zero-or-more "+", one-or-more "*", conditional "? or sometimes ...

      ...these bozos that think "regexp" sounds cool...

      Just like the bozo who just finished a Formal Computation course, yet mixed up the meanings of "+" and "*" ? ;-)


      From man grep:

      A regular expression may be followed by one of several
      repetition operators:
      ? The preceding item is optional and matched at most
      once.
      * The preceding item will be matched zero or more
      times.

      I hear they're serving humble pie at the school cafeteria today. ;-)

  12. And he's Qualified to review this book???? by CSG_SurferDude · · Score: 3, Funny

    (to be honest, I had never heard of lookaround operators before!).

    Gezzzz, This guy hasn't even heard of lookaround operators before? What a clueless fool! He should be driven from /. after being tarred and feathered!

    Everyone knows that a lookaround operator is that guy that goes into the bank first to make sure that there aren't any armed guards or policemen/women getting their paychecks deposited.

    /me runs and hides now! ;-)

  13. Re:Another karma whore post by Anonymous Coward · · Score: 1, Funny

    Yea, the post should at least bust on Microsoft, make some kind of esoteric unfunny comment about CowboyNeal, or praise CmdrTaco to get +4 Interesting. Some moderators are smoking crack today, and they're not sharing.

  14. Re:+4 Informative? He doesn't even have to own... by carlos_benj · · Score: 2, Funny

    I suppose on /. that would be considered a regular expression....

    --

    --

    As a matter of fact, I am a lawyer. But I play an actor on TV.

  15. My Version... by BinaryCodedDecimal · · Score: 5, Funny

    Mastering Regular Expressions:

    Repeat after me:

    "I'm so hungry, I could eat a horse."

    "It's been raining cats and dogs."

    "I'll sleep with you when Hell freezes over."

    And my personal favourite:

    "Oh look, Hell just froze over!"

  16. what did one regular expression say to the other? by jdew · · Score: 3, Funny

    what did one regular expression say to the other?
    .*

  17. Re:Don't go overboard by tshak · · Score: 4, Funny

    Friedl had an example of a huge, horrible (but efficient) regex to parse mail headers in the first edition

    And I'm pissed that it's NOT in the second edition (at least it couldn't easily be found). I was trying to impress this chick at B&N the other day by showing her how I understood that longass expression and low-and-behold, the back page where it's SUPPOSED to be is filled with a 3 line regex - not very impressive after you've made a huge deal about a full-page regex. Fortunately it all worked out since I had the original at home, and I was like "well, you'll just have to come over to MY place to check out the big regex". ;-)

    --

    There is no longer anything that can be done with computers that is nontrivial and clearly legal. -- Paul Phillips
  18. Re:Don't go overboard by kmellis · · Score: 2, Funny
    And I'm pissed that it's NOT in the second edition (at least it couldn't easily be found). I was trying to impress this chick at B&N the other day by showing her how I understood that longass expression and low-and-behold, the back page where it's SUPPOSED to be is filled with a 3 line regex - not very impressive after you've made a huge deal about a full-page regex. Fortunately it all worked out since I had the original at home, and I was like "well, you'll just have to come over to MY place to check out the big regex". ;-)
    When I read this book, I found myself in amazement at the enormous powers of regexes--you can do almost anything with them!

    However, it never occured to me, oddly, to use regexes as a tool of seduction. I guess I just don't understand the ladies.