Slashdot Mirror


Mastering Regular Expressions

Simon P. Chappell writes "Classics are funny things, especially in the world of books. There are books that people say "should' be classics (I'll refrain from mentioning names to protect the pretentious) and then there are books that people are too busy actually using to get around to listing as classics. Mastering Regular Expressions, now in it's third edition, is in the second group. It's one of those books that you see on desks in computer departments the world over. This is a real "doers" book." Read the rest of Simon's review. Mastering Regular Expressions author Jeffrey E.F. Friedl pages 515 (31 page index) publisher O'Reilly rating 11 out of 10 reviewer Simon P. Chappell ISBN 0596528124 summary A classic of modern computer literature.

This is a book for programmers; managers, project managers and architects need not apply. If you talk about code instead of writing it and have teams of programmers report to you, then consider buying this book and giving it to them. If you're a technical lead or lead programmer, then shame on you if an earlier edition of this book isn't already on your shelves! The majority of examples are written using Perl, but if you can read basic Perl (Pidgin Perl, perhaps?) then you'll be fine with the examples. Programmers in PHP, Java, .NET and Ruby also have dedicated sections of the book, so it's very inclusive and almost platform agnostic.

The book has ten chapters divided into two parts. Chapters one through six are what Mr. Friedl calls the "story" of regular expressions. Chapters seven through ten are an examination of the specific regular expression capabilities of Perl, Java, .NET and PHP.

Chapter one is an introduction to regular expressions. At only 33 pages, you might think that it would be shallow, but rather, it is knowledge dense. The examples in the first chapter use egrep extensively. This makes a lot of sense as it's an advanced tool, easy to use and freely available for most modern operating systems.

Chapter two builds on this introduction with extended introductory examples. These are written in Perl (again, simple and easy to follow), but there is no doubt that the regular expressions are the stars of the show around here. The examples are small Perl programs, but their benefit is that Mr. Friedl talks the reader through the process of creating each of them. This is more useful than just presenting example programs, because with just pure examples, you are out of luck if your specific problem is not covered. With this approach, you're coached towards thinking in regular expressions and are more equipped to address your personal regular expression needs.

Chapter three provides an overview of regular expression features and flavors. It starts with a historical view of the development of regular expressions, including a few asides about the influence that the earlier versions of the book have had on that development. After that, the chapter uses a search and replace example to demonstrate some of the differences between flavors of regular expression capabilities provided by different programming languages. Strings, Unicode and metacharacters round out this overview.

Strap yourself in for chapter four; it's time to talk about the computer science that makes all of that matching work. If you didn't know the difference between an NFA and a DFA regular expression engine before you start this chapter, you most certainly will by the end of it. At first sight, it might seem that this is chapter for the pure propeller heads amongst us. While there is much theory here, it's all presented in the light of how your regular expression engine is trying to do what you asked. By understanding the approaches to regular expression processing, we can learn to help ourselves. We help ourselves when we write regular expressions that run faster and use less memory. We write better regular expressions when we understand the consequences of what we write. For example, the oft written ".*" (dot star) seems like a great way to ignore a bunch of stuff in the middle of an expression, but such simplistic use is just waiting to bite you. This chapter explains why and how to deal with the situations where you'd be tempted to use simplistic expressions and how just a little extra thought can bring you the behavior you want.

Chapter five is a practical counterpoint to the previous theory chapter. Here, Mr. Friedl discusses practical regular expression techniques. There are a number of short examples, before he works through medium sized HTML processing examples and finished up with a look at processing Comma Separated Value (CSV) data.

Chapter six is efficiency. Your regular expression can be as correct as you like, but if it takes what seems like eternity to run, then it's of little use. This chapter mostly addresses NFA based engines, because they have the greatest variability based on how the regular expression is written.

Chapters seven through ten cover the specifics of using regular expressions in Perl, Java, .NET and PHP. They're well written and cover everything you need to apply the content of the first six chapters to your programming language of choice.

Everything about this book is great. This is the kind of book that O'Reilly built its reputation with. A master of the subject matter, writing in a clear, easily understood manner, leaving the reader educated and able to operate comfortably with the subject matter. I may not be a regular expression guru, but I feel that I have a much better grasp of the fundamentals that I would need if I did want to be such a guru.

Mr. Friedl is to be commended for his clear explanations of what is, in all reality, much more complex computer science than many of us are used to dealing with. The fact that his explanations are highly readable and enjoyable is a significant bonus.

There is a website for the book, regex.info and a blog at regex.info/blog, where Mr. Friedl has some wonderful photographs of Japanese gardens with their autumn colors. (Nothing to do with regular expressions, but they appealed to my inner photographer.)

Lastly, while the book is not intended to be an encyclopedia of regular expressions, all of the examples are very relevant to programmers needs and this book can easily serve that reference role.

At the risk of sounding like some kind of O'Reilly shill or a relative of Mr. Friedl, I must report that I don't think that I found a single thing I didn't like about this book.

This is a classic of the first order. Nail it to your desk unless you want to be constantly retrieving it from your co-workers. If I might be permitted a Spinal Tap reference, this one goes to eleven. If you ever use regular expressions, are thinking of using regular expressions or are in the same room as a regular expression, then you need this book.

You can purchase Mastering Regular Expressions from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

11 of 208 comments (clear)

  1. Personally... by rainman_bc · · Score: 5, Informative

    I just like to go to http://www.regular-expressions.info/ myself - I seem to find all the stuff I forget from time to time there...

    --
    09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
  2. Re:an anecdote caused by this good book by Sebastopol · · Score: 3, Informative


    The regex extentions have been mainstream since perl5.8.

    The other co. is using perl5.004, which doesn't even support >2GB files.

    Trolls are the worst when they make uninformed assertions, you must work for the company that got my code.

    --
    https://www.accountkiller.com/removal-requested
  3. Re:Agreed! by orasio · · Score: 3, Informative

    Regular expressions have academic books behind them, and computer science books are written about them.
    Maybe what you talk about is nice, but REs (with extensions) are kind of ultimate solutions to the problem they try to solve (describing an automaton in a string of characters).

    The only thing that is needed to use another complete system is a theorem that proves there is a two way conversion between the system you like and REs, and then it would be fairly easy to implement everywhere.

  4. Re:New for 3rd Edition by c0rr1n · · Score: 4, Informative

    Mastering Regular Expressions, Third Edition, now includes a full chapter devoted to PHP and its powerful and expressive suite of regular expression functions, in addition to enhanced PHP coverage in the central "core" chapters. Furthermore, this edition has been updated throughout to reflect advances in other languages, including expanded in-depth coverage of Sun's java.util.regex package, which has emerged as the standard Java regex implementation. The languages covered in Mastering Regular Expressions include Perl, Python, Ruby, Java, VB.NET and C# (and any language using the .NET Framework), PHP, and MySQL.

  5. Re:Slightly offtopic, Regex related. by prostoalex · · Score: 3, Informative

    The Regex Coach - The Regex Coach is a graphical application for Windows and Linux/x86 (also usable on FreeBSD) which can be used to experiment with (Perl-compatible) regular expressions interactively.

    The Regulator - The Regulator is an advanced, free regular expressions testing and learning tool written by Roy Osherove. It allows you to build and verify a regular expression against any text input, file or web, and displays matching, splitting or replacement results within an easy to understand, hierarchical tree.

  6. Re:question for the floor by gerbercj · · Score: 2, Informative

    This book is not not really to teach you how to write regular expressions. This book teaches you to understand how your regular expressions will be parsed so that you can understand the impact of your approach and start creating expressions that are much more efficient, or that handle special cases more elegantly. It's the book that, in my case, took my skills to the next level. I still refer to it a few times a year, and am glad that it's a part of my library.

    --
    The weird part is that I can feel productive even when I'm doomed.
  7. Re:Maybe it's just me but isn't 515 pages too much by morcego · · Score: 5, Informative

    You are obvious a newbie regarding regular expresions, based on your post.

    First, 515 is not too much when talking about regular expressions. There is much to be discussed, not to mention tips&tricks to give away.

    Also, you are deadly wrong about the "small web page". First, it only talks about Perl Regular Expressions. There are other kinds, including the classic (basic?), extended, posix and (from your reference) perl regular expressions. Mastering the different kinds is enough to fill 300 pages of the book.

    Where are you going to use REs ? sed ? VI ? perl ? php ? C ? SQL ? You need to know what flavor of REs you need for that particular environment.

    Regular expressions is a very tricky topic, and understanding them is not something easily acomplished. Come to think about it, 515 might not even be enough.

    --
    morcego
  8. Re:For Programmers? NOT! by phazer · · Score: 2, Informative

    Well, I hate to say it, but I agree with the Prof. There are really two worlds in computer science: academia and work.

    Pretty much _all_ assignments that will be given in CS courses can be solved quite easily by using a library that implements a solution. In the working life, that would be the proper solution, but not so in school.
    Of course you can just call a class in your standard library that implements regular expressions and solve a problem that way. But that's not why you're in college. You ALREADY know how to call a library that someone else wrote. Calling libraries is trivial, you can pick that up with a few pages reading and some practice. The Professor isn't there to teach you how to call libraries though. What you're supposed to take away from the class is the understanding of how the class does the work.

    Finite state machines are the underlying theory of regular grammars (See: Chomsky hierarchy of languages.) So if the class covers how FSM's work, and what their usefulness is, then you should try to actually apply that knowledge to the problem. The assignment isn't so much one of "find the answer" (nobody cares about the answer) but one of "apply the theory" and learn something new.
    One day you'll find come across a similar problem that is very similar to regular expressions, but not quite like it, and you may remeber this assignment and write a FSM to solve it, and you'll be glad for it.

    It's like you're learning about sorting algorithms, and then you come along and use Collection.sort() instead of writing your own quicksort (and understanding the algorithm while you do so.)

  9. Re:Third Edition? Already? by onlyjoking · · Score: 2, Informative

    No need for the 3rd edition unless you desperately need the extra 40-odd pages on PHP regexes. That's the only difference between the 3rd and 2nd editions as far as I can tell.

  10. Re:Snob by Jerf · · Score: 2, Informative
    My computer science degree taught me *nothing* about regular expressions. In fact, I would expect that any quality computer science degree wouldn't teach you about RE. Here's why: A good computer science degree teaches you one language, then it teaches you the concepts behind programming -- algorithim analysis, discrete math, data structures, fundamentals of programming on modern operating systems (threads, semaphores, etc), and once you learn all of the fundamentals, you are expected to be able to learn any programming language virtually at will.
    Very wrong, albeit with qualifications.

    Any competent Computer Science course should contain a discussion of the Chomsky Language Hierarchy. If you hold a computer science degree and that page is gibberish to you, you have been robbed. (Or at least, it should be familiar gibberish, for those who didn't like that course.)

    Regular Expressions come up in that course because the languages they are capable of describing are provably isomorphic to the languages that can be recognized by a Finite State Automaton, another word you ought to know if you have a computer science degree.

    The "qualifications" mentioned before is that the Regular Expressions in this case are a very limited, precisely-specified language that forms only the barest shell of the Regular Expressions that the book in question discusses. (The mathematical definition is practically useless, because very simple things like "i{0,50}" translate into a horrific mathematical RE, but most RE features can so be translated and many homework problems consist of doing just that, just to prove it can be done.) Nevertheless, it is important to understand these limitations so you don't press Regular Expressions to do something they can't really do, like entirely parse Context-Free or Context-Sensitive languages, like most programming languages (at least mostly) are.

    (It is true that some extensions, especially those found in perl, push the regular expression into Context-free or Context-sensitive territory when used correctly, but generally speaking, you're really asking for one disaster of a Regular Expression. You're better off using a parser. Perl 6 has some interesting innovations on this front, essentially building on their regular expression support to upgrade the language to built-in parser support, presumably at least the context-free level, perhaps more. I don't know.)

    It is true that a Programmer degree may not cover regular expressions, but you absolutely should have at least seen the mathematical basis for a Regular Expression in your Computer Science course. At Michigan State University where I got my education, it is, IIRC, in the sophmore course on the theory track.

    The language hierarchy is one of the absolute fundamentals of computer science.
  11. Regex Coach by Anonymous Coward · · Score: 1, Informative

    I can't believe no one has mentioned The Regex Coach at http://weitz.de/regex-coach/. While I totally agree with the 11/10 rating for MRE (I have the first and second editions), The Regex Coach is an invaluable prototyping and debugging tool for PCRE (Perl Compatible Regular Expressions). It runs on Windows and Linux and is free (but not open source IIRC).

    Both the book and the tool are 100% essential to anyone writing any regex more complicated than /^foo\s(\w+)\sbar$/.