Slashdot Mirror


Next Generation Regexp

prostoalex writes "Jeffrey E. F. Friedl, author of newly published 2nd edition of Mastering Regular Expressions, wrote a feature article for O'Reilly Network on the recent innovations in the regular expression world. You'd think that such area as regular expressions would be fairly stable, but according to the author, 'when I started to work on the second edition of Mastering Regular Expressions and started refocusing on the field, I was rather shocked to find out how much had really changed'. The article's behind-the-scene purpose is apparently to push a new book that O'Reilly published this month, but it has great educational value for anyone involved with practical extracting and reporting."

36 of 248 comments (clear)

  1. .NET regexps and Microsoft's documentation by Jobe_br · · Score: 4, Insightful

    I particularly like this bit:

    A full chapter on .NET-specific regex issues helps to clarify things, and helps to make up for the exceedingly poor documentation that Microsoft provides with the package.
    Nice to see that things haven't changed much ;)
    1. Re:.NET regexps and Microsoft's documentation by Rui+del-Negro · · Score: 4, Funny

      Microsoft's documentation reads like a novel compared to IBM's. The typical IBM manual has the following format:

      PAGE 1:

      [COMMAND1] is executed by typing the word [command1] followed by the argument string, followed by enter. The argument string consists of a sequence of non-whitespace characters separated by whitespace characters.

      [COMMAND2] is executed by typing the word [command2] followed by the argument string, followed by enter. The argument string consists of a sequence of non-whitespace characters separated by whitespace characters.

      [COMMAND3] is executed by typing the word [command3] followed by the argument string, followed by enter. The argument string consists of a sequence of non-whitespace characters separated by whitespace characters.

      PAGE 2:

      THIS PAGE IS INTENTIONALLY LEFT BLANK

      ...and so on and so on.

      Regarding this last IBM tradition (that others have tried to copy but few have truly mastered), the Spruce DVD Maestro manual has a page with the following text:

      Blank page.
      (mostly)

      RMN
      ~~~

  2. When is RegExp2 Going To Be Shipped by N8F8 · · Score: 3, Informative

    Amazon has slipped the shipping date twice. I don't know about you, but this book is definitly a "Must Have".

    --
    "God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
    1. Re:When is RegExp2 Going To Be Shipped by duck_prime · · Score: 4, Funny
      Amazon has slipped the shipping date twice.

      Yes, and that makes me want to use a decidedly irregular expression:
      #@*$^&@#$&#!!!
  3. Perl6 regular expressions - forget everything by Anonymous Coward · · Score: 3, Interesting

    Perl6 is going to radically change regular expressions as well. I guess the term "regular expression" is pretty vague/useless these days. You have to identify the language _and_ its revision to get an accurate idea of the regexp feature set you're dealing with. Just throw some variables and control structures into regexp and we'll have a full-blown extremely cryptic language. Maybe we need a RegExp Institute of Excellence with yearly meetings in Sweden or something.

  4. This has no educational purpose by Anonymous Coward · · Score: 3, Insightful

    Other than to tell us what is different between the two books. After reading the article I walked away with no general knowledge that was useful in using regular expresions, or what might be coming, or where we came from.

    It is a slightly wordy advertisment for why you should upgrade. The fact that it was foisted on us as something else annoys me, as I spent time reading it.

    I know, a slashdot reader that actually reads linked stories is such a minority, but come on, quite stuffing articles with advertising. Aren't the ads in the middle of a page enough?

  5. what about perl 6? by jbennetto · · Score: 5, Interesting

    He doesn't even mention the radical changes to regexps in Perl 6, as described in the recent Apocalypse 5 and Synopsis 5.

    1. Re:what about perl 6? by tswinzig · · Score: 3, Interesting

      He doesn't even mention the radical changes to regexps in Perl 6, as described in the recent Apocalypse 5 [perl.com] and Synopsis 5 [perl.com].

      If you could write and use a Perl 6 program right now, maybe he'd include a chapter on it in his book.

      This article is basically an overview of his book. His book doesn't cover Perl 6 regex's. Why should it? Perl 6 isn't even done yet, and so everything new for Perl 6 could change by the time it comes out.

      --

      "And like that ... he's gone."
  6. at some point... by g4dget · · Score: 4, Interesting
    Beyond a certain degree of complexity, it really doesn't make much sense anymore to use regular expressions--a simple built-in parser generator with executable annotations is both clearer and more powerful. Parser generator syntax allows comments, whitespace, with a simple, fairly standard syntax.

    Perl and other languages should leave "good enough" alone when it comes to regular expressions and instead just make it easy to put chunks of grammars into programs.

    1. Re:at some point... by joshv · · Score: 4, Insightful

      Beyond a certain degree of complexity, it really doesn't make much sense anymore to use regular expressions--a simple built-in parser generator with executable annotations is both clearer and more powerful. Parser generator syntax allows comments, whitespace, with a simple, fairly standard syntax.

      Yes, regular expressions should be used to find particular patterns in text and perform basic manipulations on them. Beyond a certain point of complexity it really doesn't make sense to perform more complex manipulations. Get the information you want out of the string using a regular expression, then manipulate it in code.

      One has a feeling that regexp engines are just becoming programming languages in and of themselves - the only difference being that the 'program' consists of a string of cryptic single character commands, and the input is limited to a single string.

      -josh

    2. Re:at some point... by Pseudonym · · Score: 3, Funny
      One has a feeling that regexp engines are just becoming programming languages in and of themselves [...]

      Not true. Yet.

      Perl 5 regexes can solve NP-hard problems, but they're not quite Turing complete. However, they require only four additional stack operators to do that.

      Personally, I'm waiting for the first Perl regex to become sentient.

      --
      sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
  7. regexp and programmers by revscat · · Score: 4, Insightful

    Over the course of my career I have come to the rather firm opinion that you are not worth much as a coder if you do not know regular expressions. I don't care what language(s) you're proficient in, or if you've memorized every single design pattern the GoF has ever conceived, of do 4 foot by 6 foot UML diagrams in your head. If you can't do regexps then you're missing a basic skill. I bought Friedl's book a couple of years ago, and although I wound up not using man of the Perl related stuff the rest of the book helped me out immensely.

    A programmer without knowledge of regular expressions is like a carpenter without a hammer.

    1. Re:regexp and programmers by Anonymous Coward · · Score: 4, Insightful

      A programmer without knowledge of regular expressions is like a carpenter without a hammer.

      If ever there was an apt analogy of regular expressions - that's it! They make everything seem like a nail ;).

    2. Re:regexp and programmers by revscat · · Score: 3, Interesting

      Sure, I've used them in a couple small scripts for parsing text, but if you see the majority of programming requiring regex, you definitely need to put your hammer down and pick up a Makita.

      Well, I am certainly not advocating the broad use of regexps in application programming, even though it has been demonstrated to be possible. For me, regexps are an important tool in solving side issues/behind the scenes work, such as formatting a series of configuration files in a given manner, or making broad changes to a set of HTML files, and so forth. I don't do Perl, and don't really like to if I can avoid it, but I still use regular expressions on a daily basis, and have found them to be immensely helpful.

  8. Disagree, Personal Experience by N8F8 · · Score: 5, Informative

    Case in point: Six months ago I was handed a printed copy of our family that was to be published by my late uncle. About 1500 pages of history and geneology. After using a scanner and OCR to get the raw text I used Regular Expressions to:

    1) Identify heirarchical relationships that were only denoted by standard oldered list types (1,1a,2,2a,3, I, II, etc).
    2) Insert html markup to reproduce proper highlighting for names and indented lists.
    3) Generate internal HTML links between individuals, their unique GEDCOM (LDS Geneology)number within the document.
    4) Build an index for chapters and an appendix to link from name, sorted bu surname back into the main document.
    5) Add special markup for converting the end HTML into indexed and linked PDF using HTMLDoc.

    Time to complete the job -2 Weeks. Without the use of Regular expression this task would have been alsmost impossible and all my Uncle's work he did to put the information together for the last two years of his life would have been lost.

    --
    "God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
  9. Re:indeed by Christianfreak · · Score: 3, Informative

    Who said anything about the Internet? Honestly I didn't read the article but I do have the first version of the book they are talking about and it has nothing to do with the internet rather pattern matching in programming.

    That is why research into regexps is doomed to failure. It is a dead end. From a theoretical standpoint, regexps are cute and interesting, but for serious data prowling, you need something with a brain and a heart.

    While I agree that for large amounts of data you need something other than a regex, but that certainly doesn't mean that regexs are dead or that we shouldn't try to make them better! I don't need Google's search algorithm to make sure my user's input matchs certain parameters and I would really hate to have to write

    if $input contains really_evil_characters() die;

    Regex is here to stay

  10. Regular Expressions Haven't Changed by jhunsake · · Score: 3, Interesting

    Regular expressions haven't changed since the seventies, at the latest. Now if you want to say that implementations of regular expressions are advancing, fine. Let's be precise in our use of language, or not.

    1. Re:Regular Expressions Haven't Changed by joto · · Score: 3, Interesting
      Well, that's true because regular expressions is nothing but a compact way to describe a deterministic finite state-machine. On the other hand, regexps are not. Regexps has nothing at all to do with deterministic finite state machines, except for the fact that the syntax is inspired by them.

      PS: Note the difference between "regular expression" which is what they teach you about in CS classes, and "regexps", which is what programmers actually use in Perl and many other languages.

  11. Ummm.... by MemeRot · · Score: 4, Funny

    "Let's be precise in our use of language, or not."

    Very compressed contentlessness.

  12. Where's Clippy when ya need him .. by TheViffer · · Score: 5, Funny

    "I see that you are writing a regular expression"

    --
    -- Knowing too much can get you killed, but knowing who knows too much can make you rich.
    1. Re:Where's Clippy when ya need him .. by fava · · Score: 3, Funny

      Shouldnt that be:

      "I see that you are swearing, would you like to use a thesaurus"

  13. Getting started with regular expressions by paj1234 · · Score: 5, Informative

    I have the first edition of "Mastering Regular Expressions" and it is indeed a very fine useful book.

    For a nice way to get started with regular expressions I recommend the wonderful "txt2regex" console program. It provides a simple text based wizard-like interface. You answer questions and the program builds your regular expression for you. See:

    http://txt2regex.sourceforge.net/

  14. Re:indeed by platypus · · Score: 5, Informative

    [talk about regexps are not so usefull..., but ...] What has become useful is what Google [google.com] taps into. And that is the human aspect. Data isn't important because it matches a*(b|c)a*. It's important because it is useful to people. Think about it: when you are looking for wares or porn, where do you go? Perl? Nope. IRC. Why? Because of the human element.

    I understand your thinking.
    But your thinking is wrong.
    Think about it (no pun intended).
    How much better would google be if one could use regexps in one's search request.
    regexp and datamining are orthogonal.

  15. Re:indeed by RisingSon · · Score: 3, Funny
    Hehe...thanks for your funny post.

    Regexps are interesting, sure.

    Not really. I use them all the time and the only time they are interesting is when you're done and they look completely silly.

    Every CS student enjoys (or suffers through!) the regexp section of their Intro to Computability (or equivalent) course.

    Not really. I got a degree in Computer Engineering from the #2 private engineering school in the country and I was never taught regex. If you know how to program and not just crank out syntax, you can pick up regex on your own pretty fast.

    And it is pretty fun thinking about the expressive power of, say (a|b)*a*b*

    That is actually a really boring regex. Lots of a's or b's folowed by lots of a's followed by lots of b's. Wow. My brain is fried.

    However, we have to face the facts, that regexps, as good as they are from a mathematical standpoint at matching things, just aren't that helpful in sorting through the sea of data that is the Internet.

    Wow. You're probably right. I'll bet nothing that searches for things on the internet, such as google.com, uses any regex internally in their code. Now that I'm facing the facts, you're right, regex is worthless when it comes to searching through any amount of data.

    The input data just aren't orderly enough for regexps to be of any use.

    Yeah, regex is best used for very very simple patterns. Anything more complex than your above example is best suited for some serious hand-parsing in visual basic.

    Think about it: when you are looking for wares or porn, where do you go? Perl? Nope.

    I don't know WTF you're talking about. I find ALL my porn at www.perlmonks.org

    That is why research into regexps is doomed to failure.

    Yeah, I should probably throw away all that perl regex code I've written thats made my company lots (and I mean lots) of money in the market. It is doomed. I should writing my pattern matching code in the google.com language.

    Thank you for posting about something you apparently know very little about. Good for an afternoon giggle.

  16. behind-the-scene purpose by jfriedl · · Score: 5, Informative

    The original poster says that the "behind-the-scene purposeis apparently to push a new book that O'Reilly published this month". Actually, that's pretty much the main point of the article -- to justify the need for a second edition, and to let people know what they'd get (or, if not interested, what they're passing on).

    I wrote the article so that people would have a feel for what's new in the book. Of course, my hope is that people are interested in the new content, but my general feeling is that the worst that can happen is that someone buys the book and finds out that it's not what they expected. Unmet expectations pretty much suck, and I hope the article helps avoid some of that suckage.... and piques some interest, as well.

    Jeffrey

    1. Re:behind-the-scene purpose by imr · · Score: 3, Interesting

      thanks for your book.
      Everybody here and there is going to say how informative it is. But, what stroke me the most, is that it is well written.
      It was very pleasant to read it, apart from the knowledge I got from it. If only all manuals ...

  17. Re:Validate XML? by bunratty · · Score: 5, Informative
    Is there a regexp to validate XML?
    No, you cannot even tell if XML is well-formed with a regex. The reason is that it takes an unbounded amount of memory to remember which tags are still open, but regexs have only a bounded amount of memory.

    One of the important aspects of using regexes is to know their limits and not try to use them outside of those limits.

    --
    What a fool believes, he sees, no wise man has the power to reason away.
  18. Friedl's book is a must read for Perl folks by Lumpish+Scholar · · Score: 5, Insightful

    It's not just a Perl book, but the language independent and Perl dependent parts are a godsend.

    I was a full time Perl programmer (with a two hour commute by rail) when Friedl's book came out. I read it cover to cover, and then recommended it strongly to my co-workers.

    Friedl shows how to write powerful, readable, efficient regular expressions that can do a lot of the work your program needs to do. It changed how my group wrote Perl (very much for the better). This is more than highly recommended; after the Blue Camel, and even before the Cookbook, this is a definitive book for all those who call themselves "Perl programmers."

    (In the first edition of the book, Friedl discovered some problems with regular expressions in early versions of Perl 5. The very next release of Perl -- 5.003, I think -- immediately fixed these problems. When Larry & Co. pay attention to a Perl book, maybe you should, too?)

    --
    Stupid job ads, weird spam, occasional insight at
  19. Re:Contentless article by Get+Behind+the+Mule · · Score: 5, Insightful
    That is one of the most contentless articles I have seen in a long time.

    A regex is a type 3 grammar. Type 3 grammars haven't really changed since Chomsky's time.
    You get a B-, Bunky. And here's your cookie.

    After you've finished your untergrad CS theory class, you might go on to discover that implementations of regexes under various paradigms and in the various languages have extremely rich variety regarding syntax, semantics and efficiency. This isn't about the pristine theory of Prof. Chomsky, but about the actual use of regexes as programming constructs, and that's a tremendously complex subject. Friedl's book in the first edition is one of the best I've ever seen that has tackled such complexity and made it accessible and useful for the everyday business of programming.

    The article indicates that the practical use of regexes, far from stagnating since Chomsky's time, continues to evolve and grow. That's only "contentless" if you're stuck in the ivory tower and don't intend to leave.
  20. perl 6 is gonna change all this by millette · · Score: 4, Insightful
    Anyone here that read the latest perl apocalypse, #5 it was, knows full well the regex as we know and love them are out-the-window. The apocalypse is a large document, so I picked this page to give you a little idea of wants going to change. The pages before that mention all the warts that Larry wants to bury.

    I understand that Perl 6 isn't near being done, and that the "r" in "Perl" doesn't necessarily stand for "regex", depending on who you ask, but Perl will always have the greatest influence over what is called a regex. Or is that going to change with Perl 6?

  21. Re:What?! by jbolden · · Score: 3, Informative

    The major reason to learn Perl is powerful string manipulation. Those "those /:+[^:]/ statements" are the power string manipulators. Try to do anything hard with strings in any language without regexes then you'll understand what the big deal is.

  22. Regex Accelerator! by Anonymous Coward · · Score: 3, Informative

    For the ultimate in regex'ing ... hardware regex accelerators!!!

  23. Re:regexp criticism by thrig · · Score: 3, Insightful

    Sounds kind of like what the Regexp::English perl module does.

    You may also want to look at the YAPE::Regex series of modules that allow parsing/extracting/explaining of regex.

  24. regexp are way overrated by The+Cookie+Monster · · Score: 5, Informative
    I know and use regular expressions, but use of regular expressions is often symptomatic of poor design, this makes me somewhat suspicious of those who live and breath regexp's and preach them to the world.
    • Text processing - why isn't your text marked up? Converting data into text, passing it along, and then trying to pluck the data back out of the text is brittle and leaves you with a system that can't be upgraded - your components can't be improved to produce a more informative text stream as it will break all the regexpr's of all the components that use that stream etc.

      Text straight from the keyboard of a user won't be marked up and seems a good place to be using regular expressions. Due to the popularity of brittle and unupgradable (is that a word?) text processing, the input from other programs might not be marked up either, here regexprs are necessary (ie symptomatic of poor design, but it wasn't your decision).

    • Parsing - how many times have you encountered a HTML or XML parser written with a regexpr? Unless your job requires you code by the seat of your pants, this is just plain lazy. Parsers written with regular expressions are always incomplete (ie they work on the subset of HTML/XML they were tested on, and if the requirements or layout ever changes they break), and they are very slow compared to a proper parser. Proper robust and well tested parsers are available under most licenses and for most languages.

      This applies to much more than just HTML or XML, eg if you're going to write a javadoc clone for your pet language, do it properly, don't do it with regular expressions.

    • Development - Regular expressions appear to be developed with a 'try it and see' methodology - people write the regexpr and test it, thinking if it works then they must have done it right. This is very brittle, I've ecountered many regexpr's for email addresses, all of them work on your bog standard address, none of them work when deployed - there's always some guy with a % in their email address or some other oddity the author of the regexpr forgot or didn't know about (and lets not even think about trying to make an RFC compliant email address regexpr, it would have to handle "blarg@wibble"@slashdot.org)

      That HTML tag stripper you hacked up, did you remember to handle comments? Just because there weren't any comments in the HTML it was tested on doesn't mean it'll never encounter them in the real world (wouldn't be an issue if an off the shelf parser is used).
    I don't know, there are other issues with regexpressions but I've spend too long on this post already. I'm curious as to other's views on this - I've just come to associate use of regular expressions with flakey or hastily written software.
    1. Re:regexp are way overrated by Anthony+Boyd · · Score: 4, Interesting
      Text processing - why isn't your text marked up?

      While you later concede that form input and input from other programs might be good reasons to use a regex, that you would even pose this question is strange. For 90% of the regex fans, form input and screen scraping is exactly what they do. For almost any Web developer, this is the day-in, day-out norm. So your point seems to downplay the very uses that have made regex's so popular.

      I've ecountered many regexpr's for email addresses, all of them work on your bog standard address, none of them work when deployed

      You realize this does not bolster your claim that regex's are "overrated" -- it merely points out that some developers are overrated. A bad developer does not make a language bad.

      That HTML tag stripper you hacked up, did you remember to handle comments?

      Same as above. You're complaining about human error and then blaming the regex system itself.

      I've just come to associate use of regular expressions with flakey or hastily written software.

      Of course. But the hastily written software is the other software we interact with, not our own. And that's a broad generalization for many developers, so of course you can find exceptions. But you asked for other people's views, and in my view, regex's are sorely needed -- not so bad developers can stay bad, but so that the good developers can clean up the messes left behind after the bad developers go. It's a nice bonus that good regex developers can pull in hostile data, screen scrape, and cleanse form input. That helped one of my employees get a raise last quarter.

  25. Re:Mod parent up! by mikec · · Score: 3, Informative

    Regular expressions were certainly an important innovation, but they're a lot more than 20 years old. They were first studied by Kleene in the mid-1950's. The first algorithm to translate them into DFA's was invented in about 1960. Lex was written in the mid 70's.