Slashdot Mirror


Next Generation Regexp

prostoalex writes "Jeffrey E. F. Friedl, author of newly published 2nd edition of Mastering Regular Expressions, wrote a feature article for O'Reilly Network on the recent innovations in the regular expression world. You'd think that such area as regular expressions would be fairly stable, but according to the author, 'when I started to work on the second edition of Mastering Regular Expressions and started refocusing on the field, I was rather shocked to find out how much had really changed'. The article's behind-the-scene purpose is apparently to push a new book that O'Reilly published this month, but it has great educational value for anyone involved with practical extracting and reporting."

13 of 248 comments (clear)

  1. Perl6 regular expressions - forget everything by Anonymous Coward · · Score: 3, Interesting

    Perl6 is going to radically change regular expressions as well. I guess the term "regular expression" is pretty vague/useless these days. You have to identify the language _and_ its revision to get an accurate idea of the regexp feature set you're dealing with. Just throw some variables and control structures into regexp and we'll have a full-blown extremely cryptic language. Maybe we need a RegExp Institute of Excellence with yearly meetings in Sweden or something.

  2. what about perl 6? by jbennetto · · Score: 5, Interesting

    He doesn't even mention the radical changes to regexps in Perl 6, as described in the recent Apocalypse 5 and Synopsis 5.

    1. Re:what about perl 6? by tswinzig · · Score: 3, Interesting

      He doesn't even mention the radical changes to regexps in Perl 6, as described in the recent Apocalypse 5 [perl.com] and Synopsis 5 [perl.com].

      If you could write and use a Perl 6 program right now, maybe he'd include a chapter on it in his book.

      This article is basically an overview of his book. His book doesn't cover Perl 6 regex's. Why should it? Perl 6 isn't even done yet, and so everything new for Perl 6 could change by the time it comes out.

      --

      "And like that ... he's gone."
  3. at some point... by g4dget · · Score: 4, Interesting
    Beyond a certain degree of complexity, it really doesn't make much sense anymore to use regular expressions--a simple built-in parser generator with executable annotations is both clearer and more powerful. Parser generator syntax allows comments, whitespace, with a simple, fairly standard syntax.

    Perl and other languages should leave "good enough" alone when it comes to regular expressions and instead just make it easy to put chunks of grammars into programs.

    1. Re:at some point... by Anonymous Coward · · Score: 1, Interesting
      Perl and other languages should leave "good enough" alone when it comes to regular expressions and instead just make it easy to put chunks of grammars into programs.

      heh.

      To be slightly more detailed, you cite the following limitations in regular expressions:
      • Comments - Perl5 allows comments in regular expressions, but the syntax is clunky. The Perl6 general comment syntax and regular expression comment syntax will be unified and simple.
      • Whitespace - In Perl5 you can already use whitespace to format your regular expressions by using the "x" modifier. In Perl6 this will be the non-optional default (which makes more sense given unicode anyway)
      • Simple syntax - I'd much rather have a rich syntax and simple code, personally.
  4. Regular Expressions Haven't Changed by jhunsake · · Score: 3, Interesting

    Regular expressions haven't changed since the seventies, at the latest. Now if you want to say that implementations of regular expressions are advancing, fine. Let's be precise in our use of language, or not.

    1. Re:Regular Expressions Haven't Changed by joto · · Score: 3, Interesting
      Well, that's true because regular expressions is nothing but a compact way to describe a deterministic finite state-machine. On the other hand, regexps are not. Regexps has nothing at all to do with deterministic finite state machines, except for the fact that the syntax is inspired by them.

      PS: Note the difference between "regular expression" which is what they teach you about in CS classes, and "regexps", which is what programmers actually use in Perl and many other languages.

  5. Re:.NET regexps and Microsoft's documentation by Anonymous Coward · · Score: 1, Interesting

    > Microsoft has the best doco of ANY software development company.

    ROTFL! Clearly you've never seen any DEC software manuals. "ANY" is more that a little bit too strong.

  6. A different look at string scanning by Anonymous Coward · · Score: 1, Interesting

    Way back when there was a programming language called "Snobol". It still lives (www.snobol4.com for a good starting point).

    Snobol is *THE* string pattern matching language. Nothing else beats it (and I've been playing around with string processing languages for over 20 years).

    Yes.. it's syntax is different and the language hasn't changed in years (decades?). But it does the job exceedingly well.

    You might also want to take a look at the Icon programming language (www.cs.arizona.edu/icon).

    Icon was developed by some of the same folks that developed Snobol. While not quite as powerful as Snobol in terms of expressing patterns, Icon extended some concepts. You can build up your own pattern matching functions.

    One of the best quotes I saw in an discussion concerning Icon and regular expressions (the discussion was that Icon lacked a builtin regular expression facility) was

    "Putting regular expressions into Icon would be like putting training wheels on a Harley" -- (I really wish I could remember who said that).

    Anyway... just something you might want to check into.

  7. Re:regexp and programmers by revscat · · Score: 3, Interesting

    Sure, I've used them in a couple small scripts for parsing text, but if you see the majority of programming requiring regex, you definitely need to put your hammer down and pick up a Makita.

    Well, I am certainly not advocating the broad use of regexps in application programming, even though it has been demonstrated to be possible. For me, regexps are an important tool in solving side issues/behind the scenes work, such as formatting a series of configuration files in a given manner, or making broad changes to a set of HTML files, and so forth. I don't do Perl, and don't really like to if I can avoid it, but I still use regular expressions on a daily basis, and have found them to be immensely helpful.

  8. Re:what? by Fizgig · · Score: 2, Interesting

    A Perl "regular expression" is more powerful than a mathematical "regular expression." Perl's can do backtracking, which a finite automaton can't do.
    The Perl "RE" "(a+)b\1" will match aba and aaaabaaa, but not abaa or aaba.

  9. Re:behind-the-scene purpose by imr · · Score: 3, Interesting

    thanks for your book.
    Everybody here and there is going to say how informative it is. But, what stroke me the most, is that it is well written.
    It was very pleasant to read it, apart from the knowledge I got from it. If only all manuals ...

  10. Re:regexp are way overrated by Anthony+Boyd · · Score: 4, Interesting
    Text processing - why isn't your text marked up?

    While you later concede that form input and input from other programs might be good reasons to use a regex, that you would even pose this question is strange. For 90% of the regex fans, form input and screen scraping is exactly what they do. For almost any Web developer, this is the day-in, day-out norm. So your point seems to downplay the very uses that have made regex's so popular.

    I've ecountered many regexpr's for email addresses, all of them work on your bog standard address, none of them work when deployed

    You realize this does not bolster your claim that regex's are "overrated" -- it merely points out that some developers are overrated. A bad developer does not make a language bad.

    That HTML tag stripper you hacked up, did you remember to handle comments?

    Same as above. You're complaining about human error and then blaming the regex system itself.

    I've just come to associate use of regular expressions with flakey or hastily written software.

    Of course. But the hastily written software is the other software we interact with, not our own. And that's a broad generalization for many developers, so of course you can find exceptions. But you asked for other people's views, and in my view, regex's are sorely needed -- not so bad developers can stay bad, but so that the good developers can clean up the messes left behind after the bad developers go. It's a nice bonus that good regex developers can pull in hostile data, screen scrape, and cleanse form input. That helped one of my employees get a raise last quarter.