Regular Expression Recipes

← Back to Stories (view on slashdot.org)

Posted by timothy on Tuesday March 22, 2005 @07:45AM from the prune-talkin' dept.

r3lody writes "If you spend time working writing applications that have to do pattern matches and/or replacements, you know about some of the intricacies of regular expressions. For many people they can be an arcane hodgepodge of odd characters that somehow manage to do wonderful things, but they don't have enough time (or interest) to really understand how to code them. Nathan A. Good has written Regular Expression Recipes: A Problem-Solution Approach for those people. In its relatively slim 289 pages, he offers 100 regular expressions in a cookbook format, tailored to solve problems in one of six broad categories (Words and Text, URLs and Paths, CSV and Tab-Delimited Files, Formatting and Validating, HTML and XML, and Coding and Using Commands)." Read on for the rest of Lodato's review. Regular Expression Recipes: A Problem-Solution Approach author Nathan A. Good pages 289 publisher Apress rating 8/10 reviewer Raymond Lodato (rlodato AT yahoo DOT com) ISBN 159059441X summary A cookbook of useful regular expressions for Perl, Python and more.

Regular expressions are not restricted to just the Perl or shell environments, so Nathan offers variations for Python, PHP, and VIM as well. In most cases the translation is relatively straight-forward, but in a few cases a different environment may have (or lack) additional facilities, prompting a different expression to do the same task.

Before you even read chapter 1, Nathan provides a quick summary course on regular expressions, with detail given to each of the five environments you might utilize. He has written the syntax overview in a highly-readable format, making it easy to understand the gobbledy-gook of the most bizarre concoctions you might encounter.

The first chapter (Words and Text) starts simply enough. He gives examples of how to find single words, multiple words, and repeated words, along with examples of how to replace various detected strings with others. In each case he gives an example of its use for each platform, followed by a bit-by-bit breakdown of how it works. Not every environment is given on every example, and in many cases the "How It Works" section refers to the first one, as most REs are identical between the platforms.

The next chapter (URLs and Paths) offers various methods of doing commonly needed parsing. Pulling out file names, query strings, and directories, as well as reconstructing them in useful fashions is covered in the 15 offerings given here. Validating, converting, and extracting fields of CSV and tab-delimited files are handled in chapter 3, while chapter 4 is concerned with validating field formats, as well as re-formatting text for the fields. Chapter 5 handles similar tasks for HTML and XML documents. The final chapter covers expressions that facilitate the management of program code, log files, and the output of selected commands.

First, I must admit that there are a number of useful solutions provided, especially for someone who is concerned with application and web development. However, I did feel a little cheated by the fact that several chapters covered essentially the same task, with only minor variations. It almost seemed as though the author was trying to pad out the solution count to the magic number 100. A simple example: three solutions in chapter one cover (a) replacing smart quotes with straight quotes, (b) replacing copyright symbols with the (c) tri-graph, and (c) replacing trademark symbols with the (tm) sequence. In each case, the expression was simply "s/\xhh/ rep /g;". Did we really need three separate chapters for that? I don't think so.

Another quibble revolves around some of the coding of the expressions. Nathan has made liberal use of the non-capturing groups (that is, (: expr )) to insure only the items that needed replacement were captured. While a worthy idea, in some cases the expression may have been simplified for understanding. Another issue is a slight error in searching for letters. In a number of expressions, Nathan uses [A-z] to capture all letters. Unfortunately, the special characters [, \, ], ^, _, and ` occur between upper-case Z and lower-case a, making it match too much. Either [[:alpha:]] or [A-Za-z] should have been used.

Despite these quibbles, Regular Expression Recipes does provide a useful compendium of solutions for common problems developers face. Presenting the information in a cookbook fashion, along with ensuring that those using something other than Perl don't have to sweat translating the expressions to their target language, makes this a handy book to have. I wouldn't hesitate to recommend it.

You can purchase Regular Expression Recipes: A Problem-Solution Approach from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

9 of 258 comments (clear)

Min score:

Reason:

Sort:

Another one? by cmstremi · 2005-03-22 07:50 · Score: 2, Insightful

Isn't there already enough coverage for Regex's? With all the existing books and the nearly endless availability of free information and sites (including many using the 'recipie' format) online, who will want this book.
Re:Unacceptable mistakes by tehshen · 2005-03-22 08:10 · Score: 2, Insightful

[A-z] accepts all characters from A to z, including [ \ ] ^ _ and `. You want [A-Za-z] or \w (latter for 'not punctuation').

--
Guy asked me for a quarter for a cup of coffee. So I bit him.
Re:Regular expressions in a cookbook? by interiot · 2005-03-22 08:17 · Score: 3, Insightful

You mean all the sections of the perl regexp manual that say "WARNING: This extended regular expression feature is considered highly experimental, and may be changed or deleted without notice" and then go on to say things that make my head truly ache?
I personally treat this like I do Perl5 threads... as something to be afraid of, and hopeful that things will be much improved in Perl 6.
Re:Regexes are overused by Black+Perl · 2005-03-22 08:29 · Score: 3, Insightful

Yes, exactly. Any good book on Regexes should have a chapter on when NOT to use them.

I see many people trying to use regexes to do parsing, when they should be using a specialized parser.

--
bp
Try them out by DavidNWelton · 2005-03-22 08:52 · Score: 5, Insightful

Sometimes, with complex regexp's, it's handy to be able to build them incrementally. I know it's just one of many, but I wrote a little tool that's handy for this. It's called regexpviewer, and it's available here:

http://www.dedasys.com/freesoftware/applications.h tml

Perhaps other people can recommend other tools they've found useful for learning/building regular expressions.

--
http://www.welton.it/davidw/
About 279 pages too long by natoochtoniket · 2005-03-22 08:57 · Score: 4, Insightful

I have a huge, 1000+ page Betty Crocker cookbook which I hardly ever use. It gives detailed recipes for particular dishes, but nothing that helps me to just throw a dinner together. And nothing that helps me to create anything new.
My very favorite recipe book is a tiny little thing of about 40 pages. For each kind of meat and each kind of vegetable, it lists what spices and sauces go well with it, how long and how hot to cook it, and how to tell when it is done. There is a little section on how to make about a dozen differnet sauces. That's it.
A programming language has syntax and semantics. For regular expressions, Chomsky gave both fully in his original paper on the subject. The added conveniences that some utilities provide are all listed in their respective man pages. The entire subject, if it were collected together, should be about 10 pages. With some explanation of language theory, grammars, and such, the whole might be worth a chapter. Get out an undergraduate compiler-theory book (such as Aho/Sethi/Ullman). They have less than a chapter on regular expressions, and they cover the topic fairly well.
But, I suppose, there is a difference between a cookbook that is made for cooks to use as a reference, and a cookbook that is made for non-cooks to follow by rote. Learn how to cook. You will be surprised how seldom you actually refer to the 1000+ page cookbooks.
Re:Regexes are overused by Alan+Shutko · 2005-03-22 11:15 · Score: 2, Insightful

To validate a simple email address, Jeffrey Friedl in his Mastering Regular Expressions book for O'Reilly writes an *11-page* regex.

That's not quite fair. That regex validates any RFC822 address, and the syntax allowed isn't simple. Validating things that are currently used is fairly easy, but there's a lot of historical baggage in RFC822 addressing.
Re:Regexes are overused by 2short · 2005-03-22 15:30 · Score: 3, Insightful

"an *11-page* regex."

That's insane. My feelings on Regexes were set early in my career. I discovered them, and like many started using them everywhere. Then in a code review, my boss pointed to one particularly complex one and said "See, there's why you shouldn't try to do such complex things with regular expressions, this one has a bug" "Where?" says me. "Let's leave that as an exercise for the student. Come ask me if you can't figure it out in an hour or so." Well, I certainly wasn't going to admit defeat, even though it took me several hours to find the rather subtle problem. So I went back and demanded to know how he had spotted it so fast. And he said "I didn't. It was a regex 3 lines long. It had to have a bug."
Re:Typical... by Monkelectric · 2005-03-22 17:20 · Score: 3, Insightful

I had a brief skirmish with the tech book publishing industrty (and believe me thats the right word). The real problem is they pay authors BY THE PAGE so their incentive is to write flowery, lengthy language which conveys as little information in as much space as possible. This in turn justifies high book prices and higher author royalties.

--
Religion is a gateway psychosis. -- Dave Foley