Regular Expression Recipes
Regular expressions are not restricted to just the Perl or shell environments, so Nathan offers variations for Python, PHP, and VIM as well. In most cases the translation is relatively straight-forward, but in a few cases a different environment may have (or lack) additional facilities, prompting a different expression to do the same task.
Before you even read chapter 1, Nathan provides a quick summary course on regular expressions, with detail given to each of the five environments you might utilize. He has written the syntax overview in a highly-readable format, making it easy to understand the gobbledy-gook of the most bizarre concoctions you might encounter.
The first chapter (Words and Text) starts simply enough. He gives examples of how to find single words, multiple words, and repeated words, along with examples of how to replace various detected strings with others. In each case he gives an example of its use for each platform, followed by a bit-by-bit breakdown of how it works. Not every environment is given on every example, and in many cases the "How It Works" section refers to the first one, as most REs are identical between the platforms.
The next chapter (URLs and Paths) offers various methods of doing commonly needed parsing. Pulling out file names, query strings, and directories, as well as reconstructing them in useful fashions is covered in the 15 offerings given here. Validating, converting, and extracting fields of CSV and tab-delimited files are handled in chapter 3, while chapter 4 is concerned with validating field formats, as well as re-formatting text for the fields. Chapter 5 handles similar tasks for HTML and XML documents. The final chapter covers expressions that facilitate the management of program code, log files, and the output of selected commands.
First, I must admit that there are a number of useful solutions provided, especially for someone who is concerned with application and web development. However, I did feel a little cheated by the fact that several chapters covered essentially the same task, with only minor variations. It almost seemed as though the author was trying to pad out the solution count to the magic number 100. A simple example: three solutions in chapter one cover (a) replacing smart quotes with straight quotes, (b) replacing copyright symbols with the (c) tri-graph, and (c) replacing trademark symbols with the (tm) sequence. In each case, the expression was simply "s/\xhh/ rep /g;". Did we really need three separate chapters for that? I don't think so.
Another quibble revolves around some of the coding of the expressions. Nathan has made liberal use of the non-capturing groups (that is, (: expr )) to insure only the items that needed replacement were captured. While a worthy idea, in some cases the expression may have been simplified for understanding. Another issue is a slight error in searching for letters. In a number of expressions, Nathan uses [A-z] to capture all letters. Unfortunately, the special characters [, \, ], ^, _, and ` occur between upper-case Z and lower-case a, making it match too much. Either [[:alpha:]] or [A-Za-z] should have been used.
Despite these quibbles, Regular Expression Recipes does provide a useful compendium of solutions for common problems developers face. Presenting the information in a cookbook fashion, along with ensuring that those using something other than Perl don't have to sweat translating the expressions to their target language, makes this a handy book to have. I wouldn't hesitate to recommend it.
You can purchase Regular Expression Recipes: A Problem-Solution Approach from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
I was performing a strange custom regular expression on the book review, and discovered that it outputted the following:
"Regex coders are in league with the devil"
Who woulda thunk it!
liqbase
regular expressions are nice and all but i still cant get used to them .. a good manual should be kept handy at all times.
Vist Lafayette Linux Users Group at http://lug.lafayette.edu. Suggestions are welcome.
I really liked this book, but
1. the binding broke
2. the index has a lot of typos.
...is the best regular recipe.
Isn't there already enough coverage for Regex's? With all the existing books and the nearly endless availability of free information and sites (including many using the 'recipie' format) online, who will want this book.
Sounds like good eating.
Regular expressions are great, but once you know them and you think you can conquer the world, I find they occasionally let you down. The text editor I was using had a rudementary regular expression search that did not support non-greedy matching. I found that writing a regular expression that finds C style /* comments */ to be quite tricky with only greeding matching. I wrote it up as an article where I build the expression piece by piece showing common things you might try that won't work.
If you want more of a challenge, try writing a regular expression that find any <script></script> tags along with anything in between using only greedy matching. You will find that the length of your regular expression goes up exponentially with the length of your ending condition.
--
Calculator for Converting Currency
I'm still looking for a good email regex, one that checks all forms of email addresses, including all the TLDs, and all the other various complicated forms email addresses can take.
DBA? Software Engineer? My company is hiring! Click
Unless of course you count machine language interactions with higher-level languages they implement, but I'm not. :)
Try not. Do or do not, there is no try.
-- Dr. Spock, stardate 2822-3.
I'm not sure I understand what your quibble is - do you dislike the fact that he uses non-capturing groups, or the fact that he disposes of them at certain points?
Another issue is a slight error in searching for letters. In a number of expressions, Nathan uses [A-z] to capture all letters. Unfortunately, the special characters [, \, ], ^, _, and ` occur between upper-case Z and lower-case a, making it match too much. Either [[:alpha:]] or [A-Za-z] should have been used.
This seems like a relatively novice mistake, and I'm surprised it would show up in a book on regular expressions.
Despite these quibbles, Regular Expression Recipes does provide a useful compendium of solutions for common problems developers face. Presenting the information in a cookbook fashion, along with ensuring that those using something other than Perl don't have to sweat translating the expressions to their target language, makes this a handy book to have. I wouldn't hesitate to recommend it.
It's nice that he covers five environments for regular expressions. I'm sure everyone has heard of Mastering Regular Expressions, published by O'Reilly. The Perl Cookbook also does a good job at solving common problems with Regular expressions.
This is just my opinion, but I think what the world needs is a book on Regular Expression Design Patterns.
How can this be a good book when it makes such mistakes? If this book is for beginners (as it seems) the editing process should have been much better.
I can relate. I have cookbooks for food that have all these recipes that are nothing but flour, butter, eggs, and sugar. Do we need all these recipes for pancakes, cupcakes, cookies, crepes, waffles, popovers, bread, quick bread, bread sticks? Won't people figure out eventually to put a little less sugar in waffles with savory ingredients?
Japanese cookbooks are even worse. Soy sauce, sake, mirin...boooooooring!
...use 'Mastering Regular Expressions . It's a good book on the topic as well.
While I can't vouch for the quality of the reviewed book,if you want something definitive on regular expressions, Mastering Regular Expressions, Second Edition by Jeffrey E. F. Friedl is an absolute must for your professional library. Jeffrey breaks down and then builds back up what regular expressions are and how they work, and offers an entire matrix breakout of the slightly different implementations among the most common utilities (grep, sed, awk, perl...). Not to shill for amazon, but if you select the reviewed book, the "buy this book too, and you get this great price" deal actually includes the Mastering Regular Expressions, Second Edition. . Get 'em both, you won't be sorry.
Some people, when confronted with a problem, think ``I know, I'll use regular expressions.'' Now they have two problems.
Jamie Zawinski
Why can't a book review for an available include the COVER PRICE ? /. editors should reject these reviews if they omit the cover price
from http://datamystic.com/
it has easy patterns:
http://datamystic.com/easypatterns.html
I used easy patterns in a project and the language is like an extra layer on top of regex making it simpler. Maybe the proprietary nature of easy patterns isn't great but there are some free tools that do conversions into Perl patterns.
Every now and then, (like once or twice a year) I can benefit from using regular expressions. It isn't worth my while to spend a lot of time learning the really arcane stuff that I need to know to use them. It's usually easier to find another way around the problem.
On the other hand, if someone produced a tool that can take any idiot (me for instance) through a step by step process that doesn't require a lot of prior knowledge and gets the job done; then I'd get really excited. For sure, I won't be reading this book; the effort will never repay itself.
I'm feeling a bit verklempt!
Talk amongst yourselves!
Alright, I'll give you a tawpic:
"Regular Expressions are neither regular nor expressions."
Discuss.
Anyone have any good recipies for [cookies]+ ?
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski alt.religion.emacs 1997/08/12
Anyone who drops in regularly on a Perl discussion forum (like perlmonks.org) knows that programmers tend to over-use regular expressions.
Regexes are actually a pretty poor way to extract information from comma-delimited or tab-delimited files, for example. By the time you're done dealing with escaped commas, escaped tabs, quoting characters (which many CSV and TDT exporters use in addition to commas and tabs), escaped quote characters, escaped newlines, and escaped escape chars, you end up with a super-complicated regex.
HTML is even more complicated. You have HTML comments and nested tags on top of everything else.
To validate a simple email address, Jeffrey Friedl in his Mastering Regular Expressions book for O'Reilly writes an *11-page* regex.
Most of the time the correct answer is not "here is a regex recipe" but rather "here is a simple library to do the job property with a parser", like Text::CSV or HTML::Parser in perl.
All you need is regexlib.com and a copy of Regulator (I believe thats the free as in beer one) that will break out a regex into english steps like "capture (" "capture 3 or more 0's", and so on.. .NET has a regex facility that's slicker than greased pigeon shit, so I've been making heavy use of it lately.
I don't need no instructions to know how to rock!!!!
way to ask a question that would certainly cause at least posts to be moderated as 'Redundant'!
You can't handle the truth.
Regex Coach
This program assists you building regular expressions. I've never used it (real men code regexp at once and it works). But some friends recommend it.
It's \. not /.
=P
I can see a lot of mod points wasted here to mark all these comments (but the first) redundant.
The problem is, you load the page, read, and by the time you reply there are already others that replied the same thing.
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (source)
The vi-style regular-expression substitution technique might help: :-)
"If you spend time working writing applications that have to do pattern matches and/or replacements, you know about some of the intricacies of \(regular expressions\). For \(many people\) \1 can be an arcane hodgepodge of odd characters that somehow manage to do wonderful things, but \2 don't have enough time (or interest) to really understand how to code \1."
In an average month, I use regular expressions as implemented in Microsoft Visual C++ 6.0, BBEdit Lite, TextWrangler, Apple MPW, and REALBasic. Every single one of them has _significant_ differences in syntax and semantics.
My understanding is that even the UNIX world sports several different flavors of regular expression in grep, egrep, fgrep, etc.
The biggest barrier to _my_ use of regular expressions is that every time I switch from one regular expression context to another, it takes me a good half hour to refresh my memory of what does and doesn't work in each environment.
"How to Do Nothing," kids activities, back in print!
Of course everyone should know how to build a regex, but why take time discussing how to parse common formats such as HTML, XML, CSV, and so on? Every language likely has a good standard module/library/package that does it all for you, hopefully in the most efficient way, and gives you an easy API. I write Perl, and have used XML::*, HTML::*, DBD::CSV, Text::CSV, the list goes on. No need to write a single regex there. Another good set of modules is Regexp::Common, giving you correct regexes for parsing semi-hard things like IP addresses, MAC addresses, phone numbers, etc.
> .Net regular expressions can parse from right to left as well.
> Very useful sometimes
Yeah, especially for parsing Hebrew text. HTH.HAND.
Cut that out, or I will ship you to Norilsk in a box.
I was hoping for an innovatively written cookbook for geeks (shell scripts to describe how to make a white sauce, that kinda thing). That would have made a fantastic gag gift.
Stasis is death. Embrace change.
I don't think im alone in saying (having spent plenty of time on freenode #sed) that of the many regex's i have had to formulate only about 5% of them are really reuseable. Most of the time its "get the some info in file X to to File Y" or make odd file X pretty. So i could bye this book, but then i would have 95 more examples of regex's to toss out.
This is free... And interactive...
http://www.regexlib.com/
Sometimes, with complex regexp's, it's handy to be able to build them incrementally. I know it's just one of many, but I wrote a little tool that's handy for this. It's called regexpviewer, and it's available here:
h tml
http://www.dedasys.com/freesoftware/applications.
Perhaps other people can recommend other tools they've found useful for learning/building regular expressions.
http://www.welton.it/davidw/
My very favorite recipe book is a tiny little thing of about 40 pages. For each kind of meat and each kind of vegetable, it lists what spices and sauces go well with it, how long and how hot to cook it, and how to tell when it is done. There is a little section on how to make about a dozen differnet sauces. That's it.
A programming language has syntax and semantics. For regular expressions, Chomsky gave both fully in his original paper on the subject. The added conveniences that some utilities provide are all listed in their respective man pages. The entire subject, if it were collected together, should be about 10 pages. With some explanation of language theory, grammars, and such, the whole might be worth a chapter. Get out an undergraduate compiler-theory book (such as Aho/Sethi/Ullman). They have less than a chapter on regular expressions, and they cover the topic fairly well.
But, I suppose, there is a difference between a cookbook that is made for cooks to use as a reference, and a cookbook that is made for non-cooks to follow by rote. Learn how to cook. You will be surprised how seldom you actually refer to the 1000+ page cookbooks.
duh! Repeat after me: HTML is not a regular language. There is no regular expression that can match it. The problem arises when people try to use regular expressions without understanding what they are. But, as the saying goes, when the only tool you have is a hammer, everything looks like a nail...
___
If you think big enough, you'll never have to do it.
Whenever I need to use some regex, I google for a regex reference and try to figure out how to do what I want to do. Then the next time I need to use regex, I have to do it again. I literally cannot hold regex in my head for more than a day or so.
Make me a friend and I'll mod you up
This seems to be typical for tech books: Way overpriced (although this one seems more reasonable), incredibly crappy binding, and less than aggressive proof reading.
"Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
I found this tool while doing my undergrad. Having this tool and playing with it showed me how to understand and how to sucessfully write regexs. 5 minutes of playing with it and you be enlightened.
http://www.weitz.de/regex-coach/
Let me correct your sentance
>I'm too stupid to
should be
"I'm too stupid to write a proper book review."
In addition to a good book, or even INSTEAD of a good book, download and use THE REGEX COACH
http://www.weitz.de/regex-coach/
It is a very very nice interactive pgm that lets you debug REGEXES on the fly visually, by feeding them sample text.
Your Syntax Highlighting library for Java rocks, thanks a million.
Your hybrid is not saving the environment. Its purpose is to make you feel good about buying something.
This free tool is great for helping you to write regular expressions: http://www.weitz.de/regex-coach/
I would like to share my regex religion with the other programmers where I work, but can't get our training department psyched-up enough to find someone to teach a class.
The Russians have won. They have made the world a cesspool of distrust, greed, fear and hate.
REs? YUMMY!
Life is like REs, you never know what you get... Great Fun!
Yes, I'm drunk. Uh, nevermind...
Rather than relying only on regular expressions, it would be beneficial to use regexps along with sed/(g)awk/perl. If the incantation that you use using regexps is obscure to you, how will the next guy who will support your stuff feel? Break up your uber regexp into a simpler combination of regexp(grep)/sed/awk combination.
With that, I almost always use anchoring via ^ or $.
No discussion of regex books is complete without mentioning the ebst one out there: O'Reilly's "Mastering Regular Expressions".
11*43+456^2
If you need a good regex debugger, you should consider kodos. http://kodos.sourceforge.net
Snobol is the granddaddy of pattern matching languages. Yes -- it's an old language, and it has an unorthodox syntax -- but regular expressions are also old and have an unorthodox syntax.
There are several different implementations out there, the easiest to deal with is csnobol. http://www.snobol4.org/ has a bunch of information on the language.
The other language, Icon, was developed from research done on Snobol. Icon provides a more modern syntax and flow control. While not as powerful in pure pattern matching as Snobol, the whole lanauge can be used when string scanning. http://www.cs.arizona.edu/icon has a bunch of information on the language. There is an object oriented version of Icon that is being developed, unicon. http://www.unicon.org/ has information.
What a bunch of crap. You don't know squat about PERL do you?
Here's some advice, buddy: before you go spouting off about PERL, why don't you go read that book with the llama on the cover. Then you can come back here and tell us all you know about PERL.
This is a handy little tool for generating regex:
http://txt2regex.sourceforge.net/
Assuming, of course that your x?html is valid - at least in Perl anyway:
cLive ;-)
-- Trinity in high heels carrying a whip: The donimatrix - there is no spoonerism
I see what you were saying now :)
OK, mine works if you never use < in your JS.
:)
cLive ;-)
-- Trinity in high heels carrying a whip: The donimatrix - there is no spoonerism
If I were sitting down to hack away at something that cannot be done (sanely, correctly, speedily, etc) without regex, I would rather have a "Cookbook" with more than enough information than adding to my work and stress load by having to scrape up what I needed (or worse, maybe *not*) on the Interweb.
AmigaBASIC was the same way. I knew enough basic BASIC to pull off some things, but without a (thick) handy manual, it's just not as much fun.
What is it?
GENERAL PUBLIC SIGNATURE (GPS) Any replies (derivatives) of this post must also use the GPS
Of course, if you use the one true text editor, all you need to know about regular expressions is:
:)
There is no need to use a SlashDot sig for SEO...
Have you noticed how regex creeps into nerd talk, like a slightly nerdier version of phone text talk. All these nerds hanging around like the fon[zs]e at the [dj]uke box with their pocket protectors while sticking square brackets around their (letters|words) because they're more used to speaking regex like that to a screen than using the word "or" to another human being?
There are quite a few regular expression tools available, with different capabilities and purposes. For the novice who doesn't want to learn more or doesn't have time, the best is probably txt2regex, which walks you through the construction of the regexp and generates output for 20 different programs and languages. It is one of the few tools that I know of that isn't specialized for a particular language or program. My own tool, Redet, provides an interface to 29 regular expression implementations. It is aimed at people who know something about regular expressions or are willing to spend some time learning but helps out by providing palettes showing the notation for each program and a history system, so that you can first construct the pieces of a complex regexp, then assemble them. It also has features aimed at providing a search environment that may be useful for people who need no help constructing their regular expressions.
regex-coach uses PERL-style regular expressions. Its particular virtue is that it can single-step through the match and show the parse tree, so it is useful if you want to understand the matching process in detail. Similar in that it helps to understand the implementation of regular expressions is re_graph, which given a regular expression draws the corresponding finite state automaton.
A couple of nice tools aimed at Python users are Kiki and Kodos.
These and some other tools and libraries are listed on this page.
On the topic of regex's and off-topic of book reviews...
This should be an easy solution but...anyone see a regex that will always grab just the domain portion from the following:
anonymous.coward.org
coward.org
anonymous.slashdot.offtopic.posting.coward.org
$domain = $1; #should be coward.org for all above
For Emacs users: M-x re-builder. It let's you test your Emacs regexps interactively.
I'm still scratching my head trying to figure out how to exclude words from a regular expression that matches also....
1) they aren't close to Turing complete, CFG's are much closer, but still don't do it. 2) Enhanced reg exps? This is an implementation of a program that seems to function like a regular expression parser. "Enhanced Reg. Exp.'s" are not enhanced Reg. Exp.'s, they are a way of writing code similar to a regular expression that must be handled WITH A STACK FRAME OR TWO. Note that this isn't, in any sense of the word, a regular expression. This is analogous to trying to explain to a user the differnce between advanced user features and the underlying programming constructs. Enhanced reg. exp's are simply a nice user interface and a misuse of the term. Questions? ask? -Dan