Slashdot Mirror


Regular Expression Recipes

r3lody writes "If you spend time working writing applications that have to do pattern matches and/or replacements, you know about some of the intricacies of regular expressions. For many people they can be an arcane hodgepodge of odd characters that somehow manage to do wonderful things, but they don't have enough time (or interest) to really understand how to code them. Nathan A. Good has written Regular Expression Recipes: A Problem-Solution Approach for those people. In its relatively slim 289 pages, he offers 100 regular expressions in a cookbook format, tailored to solve problems in one of six broad categories (Words and Text, URLs and Paths, CSV and Tab-Delimited Files, Formatting and Validating, HTML and XML, and Coding and Using Commands)." Read on for the rest of Lodato's review. Regular Expression Recipes: A Problem-Solution Approach author Nathan A. Good pages 289 publisher Apress rating 8/10 reviewer Raymond Lodato (rlodato AT yahoo DOT com) ISBN 159059441X summary A cookbook of useful regular expressions for Perl, Python and more.

Regular expressions are not restricted to just the Perl or shell environments, so Nathan offers variations for Python, PHP, and VIM as well. In most cases the translation is relatively straight-forward, but in a few cases a different environment may have (or lack) additional facilities, prompting a different expression to do the same task.

Before you even read chapter 1, Nathan provides a quick summary course on regular expressions, with detail given to each of the five environments you might utilize. He has written the syntax overview in a highly-readable format, making it easy to understand the gobbledy-gook of the most bizarre concoctions you might encounter.

The first chapter (Words and Text) starts simply enough. He gives examples of how to find single words, multiple words, and repeated words, along with examples of how to replace various detected strings with others. In each case he gives an example of its use for each platform, followed by a bit-by-bit breakdown of how it works. Not every environment is given on every example, and in many cases the "How It Works" section refers to the first one, as most REs are identical between the platforms.

The next chapter (URLs and Paths) offers various methods of doing commonly needed parsing. Pulling out file names, query strings, and directories, as well as reconstructing them in useful fashions is covered in the 15 offerings given here. Validating, converting, and extracting fields of CSV and tab-delimited files are handled in chapter 3, while chapter 4 is concerned with validating field formats, as well as re-formatting text for the fields. Chapter 5 handles similar tasks for HTML and XML documents. The final chapter covers expressions that facilitate the management of program code, log files, and the output of selected commands.

First, I must admit that there are a number of useful solutions provided, especially for someone who is concerned with application and web development. However, I did feel a little cheated by the fact that several chapters covered essentially the same task, with only minor variations. It almost seemed as though the author was trying to pad out the solution count to the magic number 100. A simple example: three solutions in chapter one cover (a) replacing smart quotes with straight quotes, (b) replacing copyright symbols with the (c) tri-graph, and (c) replacing trademark symbols with the (tm) sequence. In each case, the expression was simply "s/\xhh/ rep /g;". Did we really need three separate chapters for that? I don't think so.

Another quibble revolves around some of the coding of the expressions. Nathan has made liberal use of the non-capturing groups (that is, (: expr )) to insure only the items that needed replacement were captured. While a worthy idea, in some cases the expression may have been simplified for understanding. Another issue is a slight error in searching for letters. In a number of expressions, Nathan uses [A-z] to capture all letters. Unfortunately, the special characters [, \, ], ^, _, and ` occur between upper-case Z and lower-case a, making it match too much. Either [[:alpha:]] or [A-Za-z] should have been used.

Despite these quibbles, Regular Expression Recipes does provide a useful compendium of solutions for common problems developers face. Presenting the information in a cookbook fashion, along with ensuring that those using something other than Perl don't have to sweat translating the expressions to their target language, makes this a handy book to have. I wouldn't hesitate to recommend it.

You can purchase Regular Expression Recipes: A Problem-Solution Approach from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

11 of 258 comments (clear)

  1. Unacceptable mistakes by gniv · · Score: 5, Interesting
    In a number of expressions, Nathan uses [A-z] to capture all letters.

    How can this be a good book when it makes such mistakes? If this book is for beginners (as it seems) the editing process should have been much better.

    1. Re:Unacceptable mistakes by slim · · Score: 3, Interesting

      BTW. regular expressions present a complete Turing machine.

      Actually no: regular expressions are a great example of a language which is not Turing complete, but is useful nonetheless.

      The classic limitation of regexes is that you can't use them to parse arbitrarily nested brackets -- because there is no concept of a stack. A Turing machine would be able to do this.

      (Researching this post [yes! researching!] I found a couple of mailing list posts from various peoplel suggesting that Perl regexes are Turing complete. If this is true [which I have not established], it's because Perl extends the concept of REs in various ways)

  2. Re:Regular expressions in a cookbook? by interiot · · Score: 4, Interesting
    Yup, regular expressions are not capable of a full-range of computing... they're pretty close (they're the lowest of four in the Chomsky hierarchy), but still have a few limitations that can't be resolved without wrapping some extra code around them.

    It still boggles my mind that people knew this in 1956 though.

  3. Re:Regular expressions in a cookbook? by pcraven · · Score: 2, Interesting

    This is a cool article on catastrophic backtracking. I remember the first time that got me. It would occasionally cause severe issues on a production server we had. I swung and missed with my reg ex on that one.

  4. Re:Email RegEx by Sir_Real · · Score: 2, Interesting

    I'm still looking for a good email regex

    Well, you asked for it.

    Actually, I asked for it last week, in #linux on freenode. Scary huh?

  5. Re:F*ck this book and all others like it: by yahyamf · · Score: 1, Interesting

    .Net regular expressions can parse from right to left as well. Very useful sometimes

  6. HTML, XML, CSV, but why? by AGTiny · · Score: 3, Interesting

    Of course everyone should know how to build a regex, but why take time discussing how to parse common formats such as HTML, XML, CSV, and so on? Every language likely has a good standard module/library/package that does it all for you, hopefully in the most efficient way, and gives you an easy API. I write Perl, and have used XML::*, HTML::*, DBD::CSV, Text::CSV, the list goes on. No need to write a single regex there. Another good set of modules is Regexp::Common, giving you correct regexes for parsing semi-hard things like IP addresses, MAC addresses, phone numbers, etc.

  7. Re:Regular expressions in a cookbook? by prockcore · · Score: 2, Interesting

    I've been doing regex for a long time (over 10 years), and the best rule I can give newbies to follow is "match less, not more"

    Write your regex's so that they generalize as little as possible.

    For example, matching an xml tag use /]+>/ instead of //

    If you're using ".*?" in a regex, you might want to look at rewriting it.. it's almost never needed and almost always causes problems.

  8. Re:Regular expressions in a cookbook? by prockcore · · Score: 2, Interesting

    (damn, I should really preview sometimes)

    The examples I gave are: /<[^>]+>/ instead of /<.*?>/

  9. check out regex coach if you want to learn by Anonymous Coward · · Score: 2, Interesting

    I found this tool while doing my undergrad. Having this tool and playing with it showed me how to understand and how to sucessfully write regexs. 5 minutes of playing with it and you be enlightened.

    http://www.weitz.de/regex-coach/

  10. Re:Try them out by c_ollier · · Score: 3, Interesting

    The Regulator is a nice Open Source tool, but Windows only. It integrates expressions from RegExLib.com, and has syntax highlighting & brace matching.