Slashdot Mirror


Mastering Regular Expressions

gianluca writes "Having always been a heedful guy, I always duly did my homework, going through the lengthy manual pages of a number of regular expressions (regex) crunching tools. You name it: be it PERL, awk, emacs, sed or even one of the .NET framework languages -- any such program provides support for the same regex expressions (or at least, so they seem to the occasional observer). After some years of regex practice with these tools, I had the pretentious conviction that I knew my way through the intricacies of patterns, grouping, greediness, and the like. When I first stepped into Mastering Regular Expressions, looking at the nearly 500 pages which build up Friedl's book, I wondered what could someone ever have to say about regexes to fill so many pages." Gianluca ended up finding plenty of worthwhile content; read below for his review. Mastering Regular Expressions, 2nd edition author Jeffrey E. Friedl pages 460 publisher O'Reilly rating 9.5 reviewer Gianluca Insolvibile ISBN 0596002890 summary An in-depth guide to lead the apprentice to mastering regular expressions' wizardry

My first suspicion, I admit, was that I was facing one of the countless "man page reprints" that you find these days. It was only after reading the book that I eventually understood: before then, I had had no idea of what regexes were really about.

What it's about The book is logically divided into three parts: the first one (Chapters 1, 2 and 3) introduces the reader to the basic concepts of regexes, building a common ground upon which the subsequent chapters will be based. The introduction is clear and straightforward, and lets the readers quickly grasp the key points in the regex business. This part is more or less a good summary, presenting information that can be found also in existing manual pages (albeit presented in a distilled form, which lets you perceive that the author has very clear ideas about the matter). If you already know something about regexes, you could skip this part entirely -- even if reading it turns out to be a nice occasion to brush up and overhaul your knowledge.

The second part (Chapters 4, 5 and 6), is the one that struck me most for the depth of provided information and the richness of though. Rather than throwing at the reader usage dictates on one or another regex flavour, the author explains with a wealth of details the inward mechanisms which make regexes run and how you can exploit such knowledge to write better expressions.

Chapter 4 presents the different families of regex processing engines (namely, DFA, traditional and POSIX NFA), whose internal behavior differs so greatly that writing a regex in the appropriate way can make a substantial difference in both efficacy and efficiency. If you thought you knew it all about greedy and lazy regex operators, possessive quantifiers, backreferences and lookaround, you'd better think again: I was pleasantly surprised to discover how ignorant I was (to be honest, I had never heard of lookaround operators before!).

Chapter 5 slows down a little bit to let the reader absorb the massive previous chapter. Some simple (but still tricky) examples are presented, showing how to apply the techniques explained up to this point. A couple of examples are perhaps too contrived (ever needed to match aligned groups of 5 digits in an unspaced stream of characters?), but it is instructive anyway to follow the reasoning behind the construction of a complex regex.

Chapter 6 focuses on efficiency, considering how backtracking and matching can drive your regex engine to exponential complexities. Optimization techniques are then presented, first by explaining the automatic optimizations performed by the most common regex engines and then by giving a practical list of hints that you can follow to be sure that your expression will run as fast as possible. Again, I was quite surprised to find out how small changes in a regex can make such a big difference to the engine (and give rise to noticeable performance penalties if ignored).

What I absolutely liked most was that the author explains exactly why a certain optimization works, based on the information given in Chapter 4 (and provided that you have been able to assimilate it in the first pass). Finally, a paragraph entitled "Unrolling the loop" really put me in a good mood, reminding me of the past times of "old school" asm programming.

The third part of the book devotes three chapters to PERL, Java and .NET, respectively. Each chapter goes through the syntax and features of regexes for each language: while the information provided on Java and (VB).NET is quite commonplace, in the case of PERL the author deals with aspects rarely covered elsewhere, like dynamic regexes, embedded-code constructs, regex-literal overloading and specific optimization techniques.

What's to like In one word: insight. The author is definitely knowledgeable of regular expressions and the whole book is filled with thoughtful suggestions and hints. Still, a friendly and straightforward writing style makes reading pleasant and seldom boring (well, you wanted details, didn't you?) while you learn internal regex mechanics rarely available elsewhere.

A further nice point is the broad view offered to the reader, starting from regexes in general and focusing on specific flavours only in the final part of the book. The second edition also offers up-to-date information, covering the .NET framework and the latest versions of PERL (5.8) and Java (1.4).

What's to consider Despite the book's reassuring conversational tone, dealing with such a specific topic with so many in-depth details might sometimes become boring, especially if you do not have a strong interest in getting the most out of regular expressions or in knowing how they internally work. If you are just an occasional regex user and dwell in manual pages, you can probably live without this book. Also, it is a pity that specific sections on Tcl, emacs and awk have disappeared in the second edition (maybe they were not as current as the .NET framework ?) and that pcre (a C regex library) is barely mentioned. The summary Regular expressions are tied so strongly to the *nix culture that everyone who has been exposed to that culture has come to use them in a more or less conscious way. Still, most of the documentation around lags on basic features and presents only the most common regex operators. Mastering Regular Expressions is the book to read if you want to go further and get serious about regexes: even if extreme optimization might not be a big concern today, understanding how regex engines work under the hood greatly helps also in creating everyday small expressions. Table of Contents Preface
Chapter 1. Introduction to Regular Expressions
Chapter 2. Extended Introductory Examples
Chapter 3. Overview of Regular Expression Features and Flavors
Chapter 4. The Mechanics of Expression Processing
Chapter 5. Practical regex techniques
Chapter 6. Crafting a Regular Expression
Chapter 7. Perl
Chapter 8. Java
Chapter 9. .NET

You can purchase the Mastering Regular Expressions, 2nd edition from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

33 of 252 comments (clear)

  1. Slashbot book review by rkz · · Score: 1, Informative

    This one is a great addition to the book shelf, you all know how to do certain things regular expression but this book clarifies nicely why you are actually doing it. Also, it introduces nice advanced concepts which occasional regex users might not have come across before.

  2. Re:Regular Expressions by qorkfiend · · Score: 3, Informative
  3. Re:Regular Expressions by rkz · · Score: 4, Informative

    try this

    Its caldera's c++ portable regex lib.

  4. Cheap prices on Half.com by cybermint · · Score: 5, Informative

    I just purchased an almost new copy on Half.com for under $15 including shipping. There are still a few left at prices far lower than amazon.com or bn.com. Here is the half/ebay link.

    1. Re:Cheap prices on Half.com by cybermint · · Score: 2, Informative

      DOH! I didn't notice. I wish slashdot would let you edit posts.

      At $15 compared to $30, I'm not going to cancel my order even if it is just 1st edition. The only parts I'll miss is the extra info on new Perl 5.8 features, and maybe the unicode stuff. Guess I'll be reading perldoc.com for that.

  5. Perl, not "PERL" by carl67lp · · Score: 5, Informative

    It's always surprised me when I see intelligent people write "PERL" when they refer to Larry Wall's programming language.

    From the Perl FAQ, General Questions About Perl:

    What's the difference between "perl" and "Perl"?
    One bit. Oh, you weren't talking ASCII? :-) Larry now uses ``Perl'' to signify the language proper and ``perl'' the implementation of it, i.e. the current interpreter. Hence Tom's quip that ``Nothing but perl can parse Perl.'' You may or may not choose to follow this usage. For example, parallelism means ``awk and perl'' and ``Python and Perl'' look ok, while ``awk and Perl'' and ``Python and perl'' do not. But never write ``PERL'', because perl isn't really an acronym, aprocryphal folklore and post-facto expansions notwithstanding.

    You can read the entire FAQ if you like.

    1. Re:Perl, not "PERL" by br0ck · · Score: 5, Informative

      From an interesting interview with Larry Wall - 1999..

      Marjorie: Well, that certainly answered the question fully. I must admit I didn't expect you to go back as far as the beginning of the Universe. :-) How'd you come up with that name?

      Larry: I wanted a short name with positive connotations. (I would never name a language ``Scheme'' or ``Python'', for instance.) I actually looked at every three- and four-letter word in the dictionary and rejected them all. I briefly toyed with the idea of naming it after my wife, Gloria, but that promised to be confusing on the domestic front. Eventually I came up with the name ``pearl'', with the gloss Practical Extraction and Report Language. The ``a'' was still in the name when I made that one up. But I heard rumors of some obscure graphics language named ``pearl'', so I shortened it to ``perl''. (The ``a'' had already disappeared by the time I gave Perl its alternate gloss, Pathologically Eclectic Rubbish Lister.)

      Another interesting tidbit is that the name ``perl'' wasn't capitalized at first. UNIX was still very much a lower-case-only OS at the time. In fact, I think you could call it an anti-upper-case OS. It's a bit like the folks who start posting on the Net and affect not to capitalize anything. Eventually, most of them come back to the point where they realize occasional capitalization is useful for efficient communication. In Perl's case, we realized about the time of Perl 4 that it was useful to distinguish between ``perl'' the program and ``Perl'' the language. If you find a first edition of the Camel Book, you'll see that the title was Programming perl, with a small ``p''. Nowadays, the title is Programming Perl.

  6. netLibrary by dboyles · · Score: 4, Informative

    I first started reading this book via netLibrary through my school's library. Just the first two chapters are enough to explain regular expressions to the point where one can use them effectively in programs. The remaining chapters expand on this information and discuss language specifics. I bought a paper copy to have on my shelf, and I constantly find myself referencing it.

    To those at universities, see if your school offers netLibrary-based books. It's easy to read and it's free.

    --
    -- "Complacency is a far more dangerous attitude than outrage." -Naomi Littlebear
  7. Re:Different than 1st Edition? by sharlskdy · · Score: 5, Informative

    You can read about the differences by clicking here, which is an article by the author outlining the differences.

  8. that's the first edition by SweetAndSourJesus · · Score: 4, Informative

    Which isn't a big deal, I guess.

    Mastering Regular Expressions is now in its second edition. Mr. Friedl has posted a nice writeup about what's different in the second edition.

    --

    --
    the strongest word is still the word "free"
  9. They can be hard by DeadSea · · Score: 4, Informative

    I know from my own experiences that writing a regular expression to describe something is not always as easy as it would seem at first glance. I found it difficult to write a regular expression to define a c-style comment: /* comment */ Well, not impossible, just more difficult that I thought it would be. I posted my thought process about how I constructed a regular expression to pick out a c-style comment on my website. It's the kind of thing I like to ask interview candidates.

    1. Re:They can be hard by Otter · · Score: 3, Informative

      It's probably worth mentioning: KDE comes with a GUI regexp constructor. Googling for alternatives shows a similar Windows app.

    2. Re:They can be hard by DeadSea · · Score: 2, Informative
      You make an excellent point. The regular expression I came up with would not do the right thing in that situation when finding comments in your text editor.

      Parsers are, however, based on regular expressions. I orginally wrote this regular expression when I was writing a lexer (using JFlex) for Java. The examples that I saw used a state machine and I wanted to do it with a regex. When combined with regular expression to find sting literals (and all the regular expressions for other junk), it does the right thing.

      I should put your example on the page somewhere. :-)

    3. Re:They can be hard by FroMan · · Score: 3, Informative

      Nope, it wouldn't. Give it a try. I don't have access to a unix box here right now. But atleast the little java app I put together works correctly.

      Assuming you wanted to capture "/* hello */" out of "/* hello */ hello */"

      You see what you are missing is the '?' modifier that will cause the "(.*\r?\n)*" to not be greedy. Same with the ".*".

      I think you are just missing the some of the functionality of regexes. You might want to pick up this book. ;-)

      --
      Norris/Palin 2012
      Fact: We deserve leaders who can kick your ass and field dress your carcass.
    4. Re:They can be hard by Mr.+Droopy+Drawers · · Score: 2, Informative

      Technically, all regex's are lexer's, not parsers. Parsers must be able to be recursive.

      --

      To Copy from One is Plagiarism; To Copy from Many is Research.

    5. Re:They can be hard by maniac1860 · · Score: 2, Informative

      I'm afraid you're wrong. Parsers are stack based. Try doing matching paranthesis with a lexer (or RE).

  10. Regex Learning Tool by johndiii · · Score: 4, Informative

    Regex Coach is a great free tool for learning about regular expressions and constructing them interactively. Both Linux and Windows versions are available.

    --
    Floating face-down in a river of regret...and thoughts of you...
  11. Online resource by dema · · Score: 4, Informative

    I'd be interested to check that book out as I use reg expressions a lot in PHP. But for those of you looking for a resouce online check out RegExLib. I use it often when I'm having trouble putting an expression together and have found it extremely helpful.

  12. REGEX for Brazilians by maizena · · Score: 2, Informative

    Regex rules, but I wouldn't know anything if it wasn't for this book in portuguese: http://guia-er.sourceforge.net/. The printed version is always with me wherever I go.

  13. C++ Regular Expressions by TheOldBear · · Score: 5, Informative

    The Boost C++ libraries have a regular expression package. Take a look at http://www.boost.org/libs/regex/index.htm

    --
    Caution: Do not stare into laser with remaining eye.
  14. Newbie review by Telastyn · · Score: 2, Informative

    I also have this book [actually right next to me]. I'd put off learning perl [and indirectly regexes] for some time, because... well, I was a windows admin by trade. Now that I do other [actual] work, time came to pickup on some other tools.

    Even having not dealt with regexes pretty much at all, the book was very easy to get into. The first few chapters go through the basic matching structures, along with requisite history. All of the points are done with understandable real life examples, with diagrams and [a small amount] of actual code. The later chapters go through individual languages, and goes through which features are there, what the nuances are, and a few of the gotchas. I must admit that I probably learned more useful things about perl from this book than from any other source. There is also a large section [which I did not read, and caanot comment on] which actually details the nuts and guts of regexes.

    All and all, it's easily the best instructional [as opposed to reference] text I've ever purchased.

  15. errata by Anonymous Coward · · Score: 4, Informative

    The reviewer forgot to mention the wonderful errata list of the book! Can be found here.

  16. Interpretting parser by Frans+Faase · · Score: 3, Informative

    If you want to have something more powerful than regexprs, and still have it as an interpretter, you might have a look at an interpretting parser that I wrote: IParse.

  17. Or without a book... by Iscariot_ · · Score: 2, Informative

    For those who don't want to buy a book, here's a nice page with pre-built regexps for doing all sorts of things: RegexLib.

  18. re-builder for Emacs by David+Ishee · · Score: 3, Informative

    The re-builder mode is great for debugging regexps in Emacs. This is the latest version as far as I can tell: re-builder 1.2

    --
    Your password has expired, please login to change it.
  19. You actually liked this book? by Forgery · · Score: 2, Informative

    I have a previous version of Friedl's book and found it needlessly confusing. The author's examples often leave much to be desired. I have no doubt that all of the information about regex is somewhere in the book, but it takes an extraordinary amount of work on the reader's part to extract it.

  20. There are no ".NET Framework" languages by ClubStew · · Score: 2, Informative

    ...or even one of the .NET framework languages

    There are no ".NET framework" languages. There are languages that target the Common Language Runtime, or the CLR. The .NET Framework is nearly a class library like the JDK/JRE. If he doesn't even know that, why should I trust his book review?

  21. Re:Perl, Java, .NET.. oh my! by sketerpot · · Score: 2, Informative
    The big part of regular expressions is learning how to read and write them well. After that, just find some documentation for your language of choice.
  22. The best place for buying technical books is... by Draxinusom · · Score: 2, Informative

    www.bookpool.com

    Mastering Regular Expressions, 2nd Edition
    Our Price: $24.50

    Bookpool is consistently the cheapest place to buy technical books. And no, I am not affiliated with them in any way.

  23. Funny you should say that... by devphil · · Score: 2, Informative


    ...about switching programming environments. Right now there's some discussion about problems in regex engines which follow you around as you switch environments, due to problems in the engines.

    Curent versions of glibc (apparently) made some inefficient design choices in their regex engine. When other tools such as sed switched to using glibc's version, their performance dropped quite a bit, leading to a couple of bug reports.

    The interesting thing is, one of the messages in the bug report mentions this book. It had been a few years since I covered DFAs and NFAs in college, so I got a copy yesterday. Came back home to find this review on /.

    --
    You cannot apply a technological solution to a sociological problem. (Edwards' Law)
  24. Post is from a troll template (see below) by Chad+E+Dirks · · Score: 2, Informative

    "Today I got roughly 4 first posts but then slashdot wouldn't let me post anymore. So thats enough trolling for one day." - rkz

    To be honest, that this exact same post template has been moderated highly again and again in recent book reviews is becoming more humorous than anything. Unfortunately, and this is addressed to certain moderators, I believe it would be correct to say the laughing is 'at you' and your misfortune rather than 'with you'.

    If you would like to confirm that you are being 'taken in', click on the link represented in the parent post by the user name to be taken to a list of this user's recent posts

    There, view the user's posts made in recent book review comment threads to see this exact template used in multiple "Book Review" comment threads.

    Here are several past instances of this template being used by this user:
    Mac OS X Unleashed (2nd Edition)
    Dynamic HTML: The Definitive Reference (2nd Edition)
    Linux Network Administrator's Guide, 2nd Edition

    In the future, please feel free to cite this post as a reference to inform the opinion of future Book Review comment moderators.

  25. Sample Chapters by darkpurpleblob · · Score: 2, Informative

    A sample chapters from the book, Java and .NET are available in PDF format from the book page on O'Reilly's site.

  26. Re:tend to be the de-facto standard - dream on! by GGardner · · Score: 2, Informative
    Perl-style regexps tend to be used on things that post-date perl.

    True, but things get tricky quickly -- plain-old Unix awk predates perl. But GNU-awk (gawk) does not, so it has some perl-style regexp features, like \w, which are missing from Unix awk.