Slashdot Mirror


Next Generation Regexp

prostoalex writes "Jeffrey E. F. Friedl, author of newly published 2nd edition of Mastering Regular Expressions, wrote a feature article for O'Reilly Network on the recent innovations in the regular expression world. You'd think that such area as regular expressions would be fairly stable, but according to the author, 'when I started to work on the second edition of Mastering Regular Expressions and started refocusing on the field, I was rather shocked to find out how much had really changed'. The article's behind-the-scene purpose is apparently to push a new book that O'Reilly published this month, but it has great educational value for anyone involved with practical extracting and reporting."

12 of 248 comments (clear)

  1. When is RegExp2 Going To Be Shipped by N8F8 · · Score: 3, Informative

    Amazon has slipped the shipping date twice. I don't know about you, but this book is definitly a "Must Have".

    --
    "God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
  2. Disagree, Personal Experience by N8F8 · · Score: 5, Informative

    Case in point: Six months ago I was handed a printed copy of our family that was to be published by my late uncle. About 1500 pages of history and geneology. After using a scanner and OCR to get the raw text I used Regular Expressions to:

    1) Identify heirarchical relationships that were only denoted by standard oldered list types (1,1a,2,2a,3, I, II, etc).
    2) Insert html markup to reproduce proper highlighting for names and indented lists.
    3) Generate internal HTML links between individuals, their unique GEDCOM (LDS Geneology)number within the document.
    4) Build an index for chapters and an appendix to link from name, sorted bu surname back into the main document.
    5) Add special markup for converting the end HTML into indexed and linked PDF using HTMLDoc.

    Time to complete the job -2 Weeks. Without the use of Regular expression this task would have been alsmost impossible and all my Uncle's work he did to put the information together for the last two years of his life would have been lost.

    --
    "God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
  3. Re:indeed by Christianfreak · · Score: 3, Informative

    Who said anything about the Internet? Honestly I didn't read the article but I do have the first version of the book they are talking about and it has nothing to do with the internet rather pattern matching in programming.

    That is why research into regexps is doomed to failure. It is a dead end. From a theoretical standpoint, regexps are cute and interesting, but for serious data prowling, you need something with a brain and a heart.

    While I agree that for large amounts of data you need something other than a regex, but that certainly doesn't mean that regexs are dead or that we shouldn't try to make them better! I don't need Google's search algorithm to make sure my user's input matchs certain parameters and I would really hate to have to write

    if $input contains really_evil_characters() die;

    Regex is here to stay

  4. Getting started with regular expressions by paj1234 · · Score: 5, Informative

    I have the first edition of "Mastering Regular Expressions" and it is indeed a very fine useful book.

    For a nice way to get started with regular expressions I recommend the wonderful "txt2regex" console program. It provides a simple text based wizard-like interface. You answer questions and the program builds your regular expression for you. See:

    http://txt2regex.sourceforge.net/

  5. Re:indeed by platypus · · Score: 5, Informative

    [talk about regexps are not so usefull..., but ...] What has become useful is what Google [google.com] taps into. And that is the human aspect. Data isn't important because it matches a*(b|c)a*. It's important because it is useful to people. Think about it: when you are looking for wares or porn, where do you go? Perl? Nope. IRC. Why? Because of the human element.

    I understand your thinking.
    But your thinking is wrong.
    Think about it (no pun intended).
    How much better would google be if one could use regexps in one's search request.
    regexp and datamining are orthogonal.

  6. behind-the-scene purpose by jfriedl · · Score: 5, Informative

    The original poster says that the "behind-the-scene purposeis apparently to push a new book that O'Reilly published this month". Actually, that's pretty much the main point of the article -- to justify the need for a second edition, and to let people know what they'd get (or, if not interested, what they're passing on).

    I wrote the article so that people would have a feel for what's new in the book. Of course, my hope is that people are interested in the new content, but my general feeling is that the worst that can happen is that someone buys the book and finds out that it's not what they expected. Unmet expectations pretty much suck, and I hope the article helps avoid some of that suckage.... and piques some interest, as well.

    Jeffrey

  7. Re:Validate XML? by bunratty · · Score: 5, Informative
    Is there a regexp to validate XML?
    No, you cannot even tell if XML is well-formed with a regex. The reason is that it takes an unbounded amount of memory to remember which tags are still open, but regexs have only a bounded amount of memory.

    One of the important aspects of using regexes is to know their limits and not try to use them outside of those limits.

    --
    What a fool believes, he sees, no wise man has the power to reason away.
  8. Re:What?! by jbolden · · Score: 3, Informative

    The major reason to learn Perl is powerful string manipulation. Those "those /:+[^:]/ statements" are the power string manipulators. Try to do anything hard with strings in any language without regexes then you'll understand what the big deal is.

  9. Regex Accelerator! by Anonymous Coward · · Score: 3, Informative

    For the ultimate in regex'ing ... hardware regex accelerators!!!

  10. regexp are way overrated by The+Cookie+Monster · · Score: 5, Informative
    I know and use regular expressions, but use of regular expressions is often symptomatic of poor design, this makes me somewhat suspicious of those who live and breath regexp's and preach them to the world.
    • Text processing - why isn't your text marked up? Converting data into text, passing it along, and then trying to pluck the data back out of the text is brittle and leaves you with a system that can't be upgraded - your components can't be improved to produce a more informative text stream as it will break all the regexpr's of all the components that use that stream etc.

      Text straight from the keyboard of a user won't be marked up and seems a good place to be using regular expressions. Due to the popularity of brittle and unupgradable (is that a word?) text processing, the input from other programs might not be marked up either, here regexprs are necessary (ie symptomatic of poor design, but it wasn't your decision).

    • Parsing - how many times have you encountered a HTML or XML parser written with a regexpr? Unless your job requires you code by the seat of your pants, this is just plain lazy. Parsers written with regular expressions are always incomplete (ie they work on the subset of HTML/XML they were tested on, and if the requirements or layout ever changes they break), and they are very slow compared to a proper parser. Proper robust and well tested parsers are available under most licenses and for most languages.

      This applies to much more than just HTML or XML, eg if you're going to write a javadoc clone for your pet language, do it properly, don't do it with regular expressions.

    • Development - Regular expressions appear to be developed with a 'try it and see' methodology - people write the regexpr and test it, thinking if it works then they must have done it right. This is very brittle, I've ecountered many regexpr's for email addresses, all of them work on your bog standard address, none of them work when deployed - there's always some guy with a % in their email address or some other oddity the author of the regexpr forgot or didn't know about (and lets not even think about trying to make an RFC compliant email address regexpr, it would have to handle "blarg@wibble"@slashdot.org)

      That HTML tag stripper you hacked up, did you remember to handle comments? Just because there weren't any comments in the HTML it was tested on doesn't mean it'll never encounter them in the real world (wouldn't be an issue if an off the shelf parser is used).
    I don't know, there are other issues with regexpressions but I've spend too long on this post already. I'm curious as to other's views on this - I've just come to associate use of regular expressions with flakey or hastily written software.
  11. Re:Mod parent up! by mikec · · Score: 3, Informative

    Regular expressions were certainly an important innovation, but they're a lot more than 20 years old. They were first studied by Kleene in the mid-1950's. The first algorithm to translate them into DFA's was invented in about 1960. Lex was written in the mid 70's.

  12. Re:Validate XML? by Pembers · · Score: 2, Informative

    You're correct in saying that regexps alone can't validate XML (or any hierarchical structure, come to that). This is an instance of the bracket-matching problem: given a string composed of opening and closing brackets that can nest, determine whether the string is properly balanced or not. For instance, ()() and (()()) are balanced, while (() and (())) are not.

    The reason that a regexp can't do this is that it can't keep track of which opening brackets haven't been closed. A regexp has no memory of what it's already seen. All it knows is what state it's in now, and what token is coming next. OK, some programming languages implement regexps in such a way as to provide some sort of memory of what's been seen, but these usually feel like kludges.

    If you're prepared to put up with an arbitrary limit on how deeply you can nest brackets, then you can solve the bracket-matching problem with an automaton that has N states, numbered 1 to N. If the automaton is in the state numbered x, that means that it's seen x opening brackets that haven't been closed yet. The instructions for each state would be "if you see an opening bracket, go to state x+1, if you see a closing bracket, go to state x-1, and if you see the end of the string, it isn't balanced." Exceptions would be that in state 1, if you see the end of the string, it's balanced, and if you see a closing bracket, it isn't balanced. In state N, if you see an opening bracket, the brackets are nested too deeply.

    Of course, no theoretical computer scientist would ever accept arbitrary limits on how deeply a structure could be nested, which is why you would use a context-free (aka type 2) grammar to solve problems like this one.