Regular Expression Recipes
Regular expressions are not restricted to just the Perl or shell environments, so Nathan offers variations for Python, PHP, and VIM as well. In most cases the translation is relatively straight-forward, but in a few cases a different environment may have (or lack) additional facilities, prompting a different expression to do the same task.
Before you even read chapter 1, Nathan provides a quick summary course on regular expressions, with detail given to each of the five environments you might utilize. He has written the syntax overview in a highly-readable format, making it easy to understand the gobbledy-gook of the most bizarre concoctions you might encounter.
The first chapter (Words and Text) starts simply enough. He gives examples of how to find single words, multiple words, and repeated words, along with examples of how to replace various detected strings with others. In each case he gives an example of its use for each platform, followed by a bit-by-bit breakdown of how it works. Not every environment is given on every example, and in many cases the "How It Works" section refers to the first one, as most REs are identical between the platforms.
The next chapter (URLs and Paths) offers various methods of doing commonly needed parsing. Pulling out file names, query strings, and directories, as well as reconstructing them in useful fashions is covered in the 15 offerings given here. Validating, converting, and extracting fields of CSV and tab-delimited files are handled in chapter 3, while chapter 4 is concerned with validating field formats, as well as re-formatting text for the fields. Chapter 5 handles similar tasks for HTML and XML documents. The final chapter covers expressions that facilitate the management of program code, log files, and the output of selected commands.
First, I must admit that there are a number of useful solutions provided, especially for someone who is concerned with application and web development. However, I did feel a little cheated by the fact that several chapters covered essentially the same task, with only minor variations. It almost seemed as though the author was trying to pad out the solution count to the magic number 100. A simple example: three solutions in chapter one cover (a) replacing smart quotes with straight quotes, (b) replacing copyright symbols with the (c) tri-graph, and (c) replacing trademark symbols with the (tm) sequence. In each case, the expression was simply "s/\xhh/ rep /g;". Did we really need three separate chapters for that? I don't think so.
Another quibble revolves around some of the coding of the expressions. Nathan has made liberal use of the non-capturing groups (that is, (: expr )) to insure only the items that needed replacement were captured. While a worthy idea, in some cases the expression may have been simplified for understanding. Another issue is a slight error in searching for letters. In a number of expressions, Nathan uses [A-z] to capture all letters. Unfortunately, the special characters [, \, ], ^, _, and ` occur between upper-case Z and lower-case a, making it match too much. Either [[:alpha:]] or [A-Za-z] should have been used.
Despite these quibbles, Regular Expression Recipes does provide a useful compendium of solutions for common problems developers face. Presenting the information in a cookbook fashion, along with ensuring that those using something other than Perl don't have to sweat translating the expressions to their target language, makes this a handy book to have. I wouldn't hesitate to recommend it.
You can purchase Regular Expression Recipes: A Problem-Solution Approach from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
I was performing a strange custom regular expression on the book review, and discovered that it outputted the following:
"Regex coders are in league with the devil"
Who woulda thunk it!
liqbase
I really liked this book, but
1. the binding broke
2. the index has a lot of typos.
...is the best regular recipe.
Isn't there already enough coverage for Regex's? With all the existing books and the nearly endless availability of free information and sites (including many using the 'recipie' format) online, who will want this book.
Sounds like good eating.
Regular expressions are great, but once you know them and you think you can conquer the world, I find they occasionally let you down. The text editor I was using had a rudementary regular expression search that did not support non-greedy matching. I found that writing a regular expression that finds C style /* comments */ to be quite tricky with only greeding matching. I wrote it up as an article where I build the expression piece by piece showing common things you might try that won't work.
If you want more of a challenge, try writing a regular expression that find any <script></script> tags along with anything in between using only greedy matching. You will find that the length of your regular expression goes up exponentially with the length of your ending condition.
--
Calculator for Converting Currency
I'm not sure I understand what your quibble is - do you dislike the fact that he uses non-capturing groups, or the fact that he disposes of them at certain points?
Another issue is a slight error in searching for letters. In a number of expressions, Nathan uses [A-z] to capture all letters. Unfortunately, the special characters [, \, ], ^, _, and ` occur between upper-case Z and lower-case a, making it match too much. Either [[:alpha:]] or [A-Za-z] should have been used.
This seems like a relatively novice mistake, and I'm surprised it would show up in a book on regular expressions.
Despite these quibbles, Regular Expression Recipes does provide a useful compendium of solutions for common problems developers face. Presenting the information in a cookbook fashion, along with ensuring that those using something other than Perl don't have to sweat translating the expressions to their target language, makes this a handy book to have. I wouldn't hesitate to recommend it.
It's nice that he covers five environments for regular expressions. I'm sure everyone has heard of Mastering Regular Expressions, published by O'Reilly. The Perl Cookbook also does a good job at solving common problems with Regular expressions.
This is just my opinion, but I think what the world needs is a book on Regular Expression Design Patterns.
How can this be a good book when it makes such mistakes? If this book is for beginners (as it seems) the editing process should have been much better.
I can relate. I have cookbooks for food that have all these recipes that are nothing but flour, butter, eggs, and sugar. Do we need all these recipes for pancakes, cupcakes, cookies, crepes, waffles, popovers, bread, quick bread, bread sticks? Won't people figure out eventually to put a little less sugar in waffles with savory ingredients?
Japanese cookbooks are even worse. Soy sauce, sake, mirin...boooooooring!
...use 'Mastering Regular Expressions . It's a good book on the topic as well.
While I can't vouch for the quality of the reviewed book,if you want something definitive on regular expressions, Mastering Regular Expressions, Second Edition by Jeffrey E. F. Friedl is an absolute must for your professional library. Jeffrey breaks down and then builds back up what regular expressions are and how they work, and offers an entire matrix breakout of the slightly different implementations among the most common utilities (grep, sed, awk, perl...). Not to shill for amazon, but if you select the reviewed book, the "buy this book too, and you get this great price" deal actually includes the Mastering Regular Expressions, Second Edition. . Get 'em both, you won't be sorry.
Some people, when confronted with a problem, think ``I know, I'll use regular expressions.'' Now they have two problems.
Jamie Zawinski
I'm feeling a bit verklempt!
Talk amongst yourselves!
Alright, I'll give you a tawpic:
"Regular Expressions are neither regular nor expressions."
Discuss.
Um, last time I checked, Reg. Exp's are not turing complete. Take the expression O^n 1^n, which can be made by Turing machines. If you can make that for me using a Regular Expression, you deserve a Turing Award. Regular expressions are DFA/NFA complete, not turing complete... not even close!
Regular expressions are probably the first Turing-complete language to be encapsulated in another Turing-complete language (C).
Don't you just love to sound like a StarTrek character, with all that fancy terminology?
Go look up your complexity book - if you have one - regexes are not even close to Turing-complete.
Anyone have any good recipies for [cookies]+ ?
I'm still looking for a good email regex
Well, you asked for it.
Actually, I asked for it last week, in #linux on freenode. Scary huh?
Anyone who drops in regularly on a Perl discussion forum (like perlmonks.org) knows that programmers tend to over-use regular expressions.
Regexes are actually a pretty poor way to extract information from comma-delimited or tab-delimited files, for example. By the time you're done dealing with escaped commas, escaped tabs, quoting characters (which many CSV and TDT exporters use in addition to commas and tabs), escaped quote characters, escaped newlines, and escaped escape chars, you end up with a super-complicated regex.
HTML is even more complicated. You have HTML comments and nested tags on top of everything else.
To validate a simple email address, Jeffrey Friedl in his Mastering Regular Expressions book for O'Reilly writes an *11-page* regex.
Most of the time the correct answer is not "here is a regex recipe" but rather "here is a simple library to do the job property with a parser", like Text::CSV or HTML::Parser in perl.
Regex Coach
This program assists you building regular expressions. I've never used it (real men code regexp at once and it works). But some friends recommend it.
I have a suggestion. Write a few regular expressions to get your brain refreshed on them, then go read this excellent article on how regular expressions work. At the very least, it will clear some confusing things up. Most likely you'll find that having a better understanding of the underlying concepts will make it easier for you to work with regular expressions day to day.
Also, it helps if you are familiar with finite state machines. I learned about them in a couple classes while getting my CS degree, but they're not that hard and most people should be able to grasp them without any kind of formal CS training.
They may be kind of hard to get used to, but not has hard as writing, debugging and maintaining a dozen or more lines of custom string parsing code for each case where you would use one.
A good starting point is to understand finite automata and regular languages first. See http://en.wikipedia.org/wiki/Automata_theory/ for a good first reference on automata. If you can grok automata, regular expressions will click with you.
perl -e "eval pack(q{H*},join q{},qw{70 72696e74207061636b28717b482a7d2c717b343 637323635363534323533343430617d293b})"
In an average month, I use regular expressions as implemented in Microsoft Visual C++ 6.0, BBEdit Lite, TextWrangler, Apple MPW, and REALBasic. Every single one of them has _significant_ differences in syntax and semantics.
My understanding is that even the UNIX world sports several different flavors of regular expression in grep, egrep, fgrep, etc.
The biggest barrier to _my_ use of regular expressions is that every time I switch from one regular expression context to another, it takes me a good half hour to refresh my memory of what does and doesn't work in each environment.
"How to Do Nothing," kids activities, back in print!
Of course everyone should know how to build a regex, but why take time discussing how to parse common formats such as HTML, XML, CSV, and so on? Every language likely has a good standard module/library/package that does it all for you, hopefully in the most efficient way, and gives you an easy API. I write Perl, and have used XML::*, HTML::*, DBD::CSV, Text::CSV, the list goes on. No need to write a single regex there. Another good set of modules is Regexp::Common, giving you correct regexes for parsing semi-hard things like IP addresses, MAC addresses, phone numbers, etc.
This is free... And interactive...
http://www.regexlib.com/
Sometimes, with complex regexp's, it's handy to be able to build them incrementally. I know it's just one of many, but I wrote a little tool that's handy for this. It's called regexpviewer, and it's available here:
h tml
http://www.dedasys.com/freesoftware/applications.
Perhaps other people can recommend other tools they've found useful for learning/building regular expressions.
http://www.welton.it/davidw/
My very favorite recipe book is a tiny little thing of about 40 pages. For each kind of meat and each kind of vegetable, it lists what spices and sauces go well with it, how long and how hot to cook it, and how to tell when it is done. There is a little section on how to make about a dozen differnet sauces. That's it.
A programming language has syntax and semantics. For regular expressions, Chomsky gave both fully in his original paper on the subject. The added conveniences that some utilities provide are all listed in their respective man pages. The entire subject, if it were collected together, should be about 10 pages. With some explanation of language theory, grammars, and such, the whole might be worth a chapter. Get out an undergraduate compiler-theory book (such as Aho/Sethi/Ullman). They have less than a chapter on regular expressions, and they cover the topic fairly well.
But, I suppose, there is a difference between a cookbook that is made for cooks to use as a reference, and a cookbook that is made for non-cooks to follow by rote. Learn how to cook. You will be surprised how seldom you actually refer to the 1000+ page cookbooks.
If you really want to understand regexes, get Jeffrey E. E. Friedl's "Mastering Regular Expressions" from O'Reilly. It's much deeper than the casaul reader will ever need, but if you get through it you will certainly know how regexes work from both a user perspective and from a regex engine perspective.
"The legitimate powers of government extend only to such acts as are injurious to others." Thomas Jefferson.
duh! Repeat after me: HTML is not a regular language. There is no regular expression that can match it. The problem arises when people try to use regular expressions without understanding what they are. But, as the saying goes, when the only tool you have is a hammer, everything looks like a nail...
___
If you think big enough, you'll never have to do it.
I found this tool while doing my undergrad. Having this tool and playing with it showed me how to understand and how to sucessfully write regexs. 5 minutes of playing with it and you be enlightened.
http://www.weitz.de/regex-coach/
In addition to a good book, or even INSTEAD of a good book, download and use THE REGEX COACH
http://www.weitz.de/regex-coach/
It is a very very nice interactive pgm that lets you debug REGEXES on the fly visually, by feeding them sample text.
Of course, if you use the one true text editor, all you need to know about regular expressions is:
:)
There is no need to use a SlashDot sig for SEO...
I had a brief skirmish with the tech book publishing industrty (and believe me thats the right word). The real problem is they pay authors BY THE PAGE so their incentive is to write flowery, lengthy language which conveys as little information in as much space as possible. This in turn justifies high book prices and higher author royalties.
Religion is a gateway psychosis. -- Dave Foley