Slashdot Mirror


Regular Expression Pocket Reference

Michael J. Ross writes "When software developers need to manipulate text programmatically — such as finding all substrings within some text that match a particular pattern — the most concise and flexible solution is to use "regular expressions," which are strings of characters and symbols that can look anything but regular. Nonetheless, they can be invaluable for locating text that matches a pattern (the "expression"), and optionally replacing the matched text with new text. Regular expressions have proven so popular that they have been incorporated into most if not all major programming languages and editors, and even at least one Web server. But each one implements regular expressions in its own way — which is reason enough for programmers to appreciate the latest edition of Regular Expression Pocket Reference, by Tony Stubblebine." Read below for the rest of Michael's review. Regular Expression Pocket Reference, Second Edition author Tony Stubblebine pages 126 publisher O'Reilly Media rating 9/10 reviewer Michael J. Ross ISBN 0596514271 summary A pithy guide to regular expressions in many languages. The second edition of the book was published by O'Reilly Media on 18 July 2007, under the ISBNs 0596514271 and 978-0596514273. On the book's Web page, the publisher makes available the book's table of contents and index, as well as links for providing feedback and any errata. As of this writing, there are no unconfirmed errata (those submitted by readers but not yet checked by the author to see whether they are valid), and no confirmed ones, either. In fact, in my review of the first edition, published in 2004, it was noted that there were no unconfirmed errata, despite the book being out for some time prior to that review. The most likely explanation is that the author — in addition to any technical reviewers — did a thorough job of checking all of the regular expressions in the book, along with the sample code that make use of them. These efforts have paid off with the apparent absence of any errors in this new edition — something unseen in any other technical book with which I am familiar.

Before discussing this particular book, it may be of value to briefly discuss the essential concept of regular expressions, for the benefit of any readers who are not familiar with them. As noted earlier, a regular expression (frequently termed a "regex") is a string of characters intended for matching substrings in a block of text. A regex pattern can match literally, such as the pattern "book" matching both "book" and "bookshelf." A pattern can also use special characters and character combinations — often termed metasymbols and metasequences — such as \w to indicate a single word character (A-Z, a-z, 0-9, or '_'). Thus, the regex "b\w\wk" would match "book," but not "brook."

Here is a simple example to show the use of regexes in code, written in Perl: The statement "$text =~ m/book/;" would find the first instance of the string "book" inside the scalar variable $text, which presumably contains some text. To substitute all instances of the string with the word "publication," you could use the statement "$text =~ s/book/publication/g;" ('g' for globally search) or use "$text =~ s/bo{2}k/publication/g;". In this simplistic example, the second statement makes use of a quantifier, {2}, indicating two of the preceding letter.

These examples employ only one metacharacter (\w) and one quantifier ({2}). The total number of metacharacters, metasymbols, quantifiers, character classes, and assertions (to say nothing of capturing, clustering, and alternation) that are available, in most regex-enabled languages, is tremendous. However, the same cannot be said for the readability of all but the simplest regular expressions — especially lengthy ones not improved by whitespace and comments. As a consequence, when using regexes in their code, many programmers find themselves repeatedly consulting reference materials that do not focus on regular expressions. These resources comprise convoluted Perl books, incomplete tutorials on the Internet, and confusing discussions in technical newsgroups. For too many years, there was no published book providing the details of regexes for the various languages that utilize them, in addition to a clear explanation of how to use regexes wisely.

Fortunately, O'Reilly Media offers two titles in hopes of meeting that need: Mastering Regular Expressions, by Jeffrey Friedl, and Regular Expression Pocket Reference, by Tony Stubblebine. In several respects, the books are related — particularly in that Stubblebine bases his slender monograph upon Friedl's larger and more extensive title, justifiably characterized by Stubblebine as "the definitive work on the subject." In addition, Stubblebine's book follows the structure of Friedl's book, and contains page references to the same. Another major difference is that Regular Expression Pocket Reference is, just as the title indicates, for reference purposes only, and not intended as a tutorial.

At first glance, it is clear that Stubblebine's book packs a great deal of information into its modest 126 pages. That may partly be a result of the terseness of most, if not all, of the regular expression syntax; a metasymbol of more than two characters would be considered long-winded! Yet the high information density is likely also due to the manner in which Stubblebine has distilled the operators and rules, as well as the meaning and usage thereof, down to the bare bones. But this does not imply that the book is bereft of examples. Most of the sections contain at least one, and sometimes several, code fragments that illustrate the regex elements under discussion.

The book begins with a brief introduction to regexes and pattern matching, followed by an even briefer cookbook section, with Perl-style regexes for a dozen commonly-needed tasks, e.g., validating dates. The bulk of the book's material is divided into 11 sections, each one devoted to the usage of regexes within a particular language, application, or library: Perl 5.8, Java,.NET and C#, PHP, Python, Ruby, JavaScript, PCRE, the Apache Web server, the vi programmer's editor, and shell tools.

Each of these sections begins with a brief overview of how regexes fit into the overall language covered in that section. Following this is a subsection listing all of the supported metacharacters, with a summary of their meanings, in tabular format. In most cases, this is followed by a subsection showing the usage of those metacharacters — either in the form of operators or pattern-matching functions, depending upon how regular expressions are used within that language. Next is a subsection providing several examples, which is often the first material that most programmers turn to when trying to quickly figure out how to use one aspect of a language. Each section concludes with a short listing of other resources related to regexes for that particular language.

There are no glaring problems in this book, and I can only assume that all of the regular expressions themselves have been tested by the author and by previous readers. However, there is a minor weakness that should be pointed out, and could be corrected in the next edition. In most of the sections' examples, Stubblebine wisely formats the code so that every left brace ("{") is on the same line as the beginning of the statement that uses that brace, and each closing brace ("}") is lined up directly underneath the first character of the statement. This format saves space and makes it easier to match up the statement with its corresponding close brace. However, in the.NET / C# and PCRE library sections, the open braces consume their own lines, and also are indented inconsistently, as are the close braces, which makes the code less readable, as well as less consistent among the sections.

Some readers may fault the book's sparse index. Admittedly, an inadequate index in any sizable programming book can make it difficult if not impossible to find what one is looking for. As a result, one ends up flipping through the book's pages hoping to luckily spot the desired topic. This is the rather unpleasant method to which a reader must resort when a technical book has no index, or one that is inadequate — which is far too often the case. Stubblebine's index offers only several dozen entries for all the letters of the alphabet, and only two symbols. Some readers might demand that all of the metacharacters and metasequences be listed in the index, so they can be found even faster than otherwise. But given the large number of metacharacters and metasequences, as well as method names, module functions, and everything else relevant, creating an exhaustive index would almost double the size of the book, and be largely redundant with the language-specific sections. Within each language, there is typically a limited enough number of pages that scanning through them to find a particular topic, would not be onerous. On the other hand, some of the index's inclusions and omissions are odd. For instance, two symbols are listed, and yet no others; why bother with those two? Also, a few key concepts are missing, such as grouping and capturing.

Yet aside from these minor blemishes, Regular Expression Pocket Reference is a concise, well-written, and information-rich resource that should be kept on hand by any busy software developer.

Michael J. Ross is a Web developer, writer, and freelance editor.

You can purchase Regular Expression Pocket Reference, Second Edition from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

26 of 144 comments (clear)

  1. Re:General introductions to regex? by Swizec · · Score: 2, Informative

    You can always try php.net. I find that it's a fairly good introductory tutorial into regular expressions going through all the basics and such. It might be a tad specific, but the general science behind them is there and should allow you to quickly learn them in any language.

  2. Stand back, everyone by Armakuni · · Score: 5, Funny

    ...I have a pocket reference to regular expressions.

    --
    That's not Picasso, that's Kandinsky!
    1. Re:Stand back, everyone by superwiz · · Score: 4, Funny

      and I want you to watch as I fuck your sister, you mealy-mouthed faggot. Guess neither of us are going to get what we want, are we? You make me fucking sick, you all do. Fuck off before I slap you in the mouth. Ask your doctor to decrease the dose.
      --
      Any guest worker system is indistinguishable from indentured servitude.
  3. Re:General introductions to regex? by wol · · Score: 5, Informative
    --
    If you think deeply enough, you will have no single direction for your outrage.
  4. useful regular expression by stokessd · · Score: 4, Funny

    Here's the regular expression that I found most useful in childhood:

    "Hello, I'm a smart geeky person, please to not beat me up and take my lunch money. I can help you with your math homework"

    Sheldon

    1. Re:useful regular expression by Bogtha · · Score: 2, Funny

      I can help you with your math homework

      Now you have math problems.

      --
      Bogtha Bogtha Bogtha
    2. Re:useful regular expression by mav[LAG] · · Score: 3, Informative

      Pure genius and probably the first time I've laughed out loud at something to do with regexes. Hats off to you sir.

      For those of you who don't know the reference:

      Some people, when confronted with a problem, think "I know, I'll use regular expressions."
      Now they have two problems.
      --Jamie Zawinski, in comp.lang.emacs

      --
      --- Hot Shot City is particularly good.
  5. Correction for summary by Jerry+Coffin · · Score: 4, Funny

    However, there is a minor weakness that should be pointed out, and could be corrected in the next edition. In most of the sections' examples, Stubblebine wisely formats the code so that every left brace ("{") is on the same line as the beginning of the statement that uses that brace, and each closing brace ("}") is lined up directly underneath the first character of the statement. This format saves space and makes it easier to match up the statement with its corresponding close brace. However, in the.NET / C# and PCRE library sections, the open braces consume their own lines, and also are indented inconsistently, as are the close braces, which makes the code less readable, as well as less consistent among the sections.


    A minor correction:
    However, there is a minor weakness that should be pointed out, and could be corrected in the next edition. Specifically, the book includes a section on .NET/C# and PCRE. By the time the next edition is needed, Microsoft will undoubtedly have moved on to new languages running in a new environment, as well as "enhanced" regular expressions "to provide better security and a syntax that is more approachable by beginners."
    --
    The universe is a figment of its own imagination.
  6. ObJWZ by Minwee · · Score: 4, Funny

    Because you just can't discuss regular expressions without bringing up this quote:

    Some people, when confronted with a problem, think "I know, I'll use regular expressions."
    Now they have two problems.

    -- Jamie Zawinski, 1997, in alt.religion.emacs

  7. already built in by Fujisawa+Sensei · · Score: 2, Informative

    There's already a built in regular expression tutorial:

    man perlretut
    --
    If someone is passing you on the right, you are an asshole for driving in the wrong lane.
  8. Re:General introductions to regex? by athakur999 · · Score: 5, Informative

    A regex visualizer is pretty useful too for understanding how regex works. I used this one a few years ago and it does a good job:

    http://laurent.riesterer.free.fr/regexp/

    It will color code your regex pattern and the associated matches in the string to be searched so you know what is matching what.

    --
    "People that quote themselves in their signatures bother me" - athakur999
  9. Re:General introductions to regex? by MightyYar · · Score: 2, Funny

    You can't grep dead trees! Dang, you're right:
    [mini-me:/] luser% grep dead trees
    grep: trees: No such file or directory

    --
    W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
  10. Re:General introductions to regex? by Anonymous Coward · · Score: 2, Informative

    Urgh, no. I just had a look at the site, and any site with gems like this right on the front page should definitely be avoided:

    you could use the regular expression \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b Analyze this regular expression with RegexBuddy to search for an email address. Any email address, to be exact. A very similar regular expression (replace the first \b with ^ and the last one with $) can be used by a programmer to check if the user entered a properly formatted email address.

    Checking email addresses for well-formedness (not the same as validity, anyway) is possible with regexes, but the above example is definitely wrong, and anyone who wants to do so should better use a Perl module or something similar in their language of choice instead of trying to reinvent the wheel and - inevitably - getting it wrong.

    So the advice that site is giving there is flawed on several levels, and for me, that's enough to take anything and everything on there with a big grain of salt. I'd advise others to stay away and turn to more reliable resources.

  11. Re:General introductions to regex? by jlowery · · Score: 2, Funny

    The most useful link I've seen on /. in a long, long time.

    --
    If you post it, they will read.
  12. Problems by Peaker · · Score: 2, Interesting

    I'll start with an Obligatory quote.

    Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. --Jamie Zawinski, in comp.lang.emacs

    I'll close with a somewhat depressing fact: Regular expression and string processing can be done quickly and efficiently (and was done that way back decades ago, with grep and awk), but is actually done in a horribly inefficient way in all modern/popular programming language regexp engines.

    1. Re:Problems by Abcd1234 · · Score: 4, Interesting

      First off, Mr. Zawinski is recorded as being rather prejudiced against Perl, so I'd take any comments he's made regarding regex's with a massive grain of salt. In fact, I'd probably just ignore him altogether. Besides, his comments are focused almost entirely on the *mis*uses of regexes, not their appropriate application.

      As for your second complaint... uhh, who cares? Premature optimization is the devil. So if regex's allow you to cleanly implement a simple solution to a problem (and regexes *are* very well suited to certain tasks, even if they do tend to be misused, particularly in languages such as Perl where they're very tightly integrated), it would be foolish to move to another technique based solely on performance concerns without first profiling the code.

      'course, the real irony, on the performance front, is that Mr. Zawinski himself said "The heavy use of regexps in Emacs is due almost entirely to performance issues: because of implementation details, Emacs code that uses regexps will almost always run faster than code that uses more traditional control structures." So maybe they aren't so evil or slow after all?

  13. And for the Mac: RegExhibit by repetty · · Score: 4, Informative

    Another post links to a site for a regex visualizer utility for Windows and Linux.

    Here's one for the Mac:

    http://homepage.mac.com/roger_jolly/software/index.html#regexhibit

  14. I have to get one of these by HangingChad · · Score: 2, Insightful

    I'd rather stick knitting needles in my eyes than debug a regular expression.

    The only cure for that is getting a good reference and having a go at some tutorials until you get good enough to slay the beast. Then you'll be everyone's buddy at the office, because a lot of people feel the same way.

    Or you could just stick knitting needles in your eyes and slash your face with a razor and then everyone will leave you alone.

    --
    That's our life, the big wheel of shit. - The Fat Man, Blue Tango Salvage
    1. Re:I have to get one of these by tonystubblebine · · Score: 2, Interesting

      I wrote the pocket reference and also this best practices article for writing regular expressions. It's the habits I developed to avoid poking my own eyes out: http://www.onlamp.com/pub/a/onlamp/2003/08/21/regexp.html

  15. Re:General introductions to regex? by jandrese · · Score: 4, Insightful

    That and getting into that kind of depth is usually a good way to find the bugs in your regular expression library. It's also an easy way to write code that will drive maintainers crazy.

    Unless you're a hard core mathhead, that's probably not a good place to start with regexes IMHO. That's just going to scare people off from a highly useful tool. One generally does not need to rigorously prove that his regexes are going to work to use them. One does not have to use every feature of a language to make good use of it.

    --

    I read the internet for the articles.
  16. Don't fear - Just download txt2regex by Shux · · Score: 4, Informative

    Regular expressions are easier than you think and once you get comfortable with them you will be wishing you hadn't done so sooner. In my opinion the difficult part of learning them is just getting used the strange mess of dots, pluses, brackets, backslashes, etc. and what they mean in different contexts. Unfortunately it is hard to walk away from an article or howto on regexes and actually remember the meaning of all the symbols. Regular expressions are deliberately terse and that makes them hard to read and understand by humans.

    Therefore I think the best way to learn regular expressions is by example. I highly recommend this small interactive program which will walk you through building regular expressions for a few different languagues. When you think you need a regex for a program, just fire it up and answer the questions.

    http://txt2regex.sourceforge.net/

    After a while you won't need txt2regex for simple stuff because you will have hopefully just absorbed the syntax. Once you have mastered the basic regexes which txt2regex can generate you will be able to dive into more advanced topics like capturing groups.

  17. Perl, regexps by Peaker · · Score: 2, Interesting

    If you read the link I posted, you will see that they are indeed evil and slow - and not for any good reason. The implementation of good regular expression engines is not difficult and known in CS theory for many decades.

    "Premature optimization" is a nice slogan - but the regexp performance problems are real, and I have encountered them before (I was extremely surprised to see that the regexp matching is scaling far worse than O(N) as it was clear to me that matching that regexp should be at worst O(N)).

    The reason it is depressing is because they got it right in the 60's, and are getting it wrong now. Stalling progress is sad. Deteriorating is depressing.

    As for elisp regexps being faster than other elisp methods - its not very indicative, as the regexp engine is implemented in C. If you compare, however, the pathological regexps (see my original link) in elisp, compared to a naive elisp char-by-char iteration of strings, you'll see that the elisp code performs better.

    About your link, it doesn't seem that he is prejudiced against Perl, it seems that he hates Perl and that implies no prejudice. Many of us dislike or even hate Perl because we find it less suitable for all tasks than other tools that we use, and because we find that it an extremely ugly hack that strongly encourages write-once read-never code.

    1. Re:Perl, regexps by Abcd1234 · · Score: 2, Interesting

      but the regexp performance problems are real, and I have encountered them before

      That's all well and good, but unless you're parsing extremely large volumes of text, the issues are probably unimportant. Which is, of course, why profiling is so important. Throwing out a perfectly valid solution simply because it is, in theory (or even in practice) slow, is ridiculous if you have other performance problems elsewhere, or if the code is running at a speed that is sufficient for the problem at hand.

      Put another way, if regexes solve the problem in a simple and easy manner, use them. And if, in running the code, you discover it's too slow to meet your requirements, and profiling indicates the regex is a problem, then switch to something else. But dismissing regexes out of hand is silly.

      because we find that it an extremely ugly hack that strongly encourages write-once read-never code.

      Good for you. I'm not sure why you pointed this out, as I don't really care, but that's lovely. Regardless, Zawinski clearly dislikes Perl, and those quotes make it clear that this dislike has translated to regexes as well, despite their being clearly superior to other solutions for certain problem domains. It looks like you may have done the same...

  18. What is the concept of a regular expression? by Ed+Avis · · Score: 2, Informative

    it may be of value to briefly discuss the essential concept of regular expressions,
    Before you say this, make sure you know what that concept is.

    A regular expression can be thought of as a program which generates a set of strings - or recognizes a set of strings, which is the same thing. Regular expressions correspond to finite state automatons, so just as a FSA cannot recognize the set of all palindromes, neither can a regular expression. Also languages like perl have extended the capabilities of their regular expression string matchers to include things like backreferences, which cannot be done in a true regular expression, so we tend to use the word 'regexp' nowadays.

    Or perhaps I'm just playing the grumpy computer scientist here.
    --
    -- Ed Avis ed@membled.com
    1. Re:What is the concept of a regular expression? by evilWurst · · Score: 2, Informative

      Rewording Ed for you: you can think of a "true" regular expression as just a shorthand for describing a state machine. Feed a state machine a string and it can only either accept or reject. Backreferences are an addition to the modern programming implementation of regular expressions, but aren't part of the language theory sense of regular expressions. You can do things with backreferences that *cannot* be done with a deterministic finite state automata. Interestingly, that wiki link has a quote from Larry Wall also saying that Perl regexes aren't real regular expressions :)

  19. Re:General introductions to regex? by Anonymous Coward · · Score: 2, Insightful

    Does the book, or any other reference explain why we need such an obtuse mechanism for parsing strings in the first place?
    What's obtuse about them? They're a straightforward and direct way of describing text patterns, and perfectly intuitive if you have an analytical mind (and if you don't, you shouldn't be programming in the first place).

    Here's a REXX example from Wikipedia:

    myVar = "(202) 123-1234"
    parse var MyVar 2 AreaCode 5 7 SubNumber
    say "Area code is:" AreaCode
    say "Subscriber number is:" SubNumber
    This is your idea of "intuitive"? Don't make me laugh! It's difficult to understand (humans work by recognising patterns, not by counting characters) and it's fragile (what happens if someone puts an extra space in after the area code?).

    Here's the Perl equivalent:

    my $var = "(202) 123-1234";
    my ($areacode, $subnumber) = $var =~ m{
        \( (\d+) \) # Area code (parenthesized digits)
        \s* # Optional whitespace
        (\d+-\d+) # Subscriber number (two groups of digits separated by a hyphen)
    }x;
    print "Area code is: $areacode\nSubscriber number is: $subnumber\n";
    Much more readable (though not quite as readable as it was before Slashdot mangled it by squishing the whitespace before the comments) and it's not often you get to say that about Perl!