Slashdot Mirror


Next Generation Regexp

prostoalex writes "Jeffrey E. F. Friedl, author of newly published 2nd edition of Mastering Regular Expressions, wrote a feature article for O'Reilly Network on the recent innovations in the regular expression world. You'd think that such area as regular expressions would be fairly stable, but according to the author, 'when I started to work on the second edition of Mastering Regular Expressions and started refocusing on the field, I was rather shocked to find out how much had really changed'. The article's behind-the-scene purpose is apparently to push a new book that O'Reilly published this month, but it has great educational value for anyone involved with practical extracting and reporting."

248 comments

  1. .NET regexps and Microsoft's documentation by Jobe_br · · Score: 4, Insightful

    I particularly like this bit:

    A full chapter on .NET-specific regex issues helps to clarify things, and helps to make up for the exceedingly poor documentation that Microsoft provides with the package.
    Nice to see that things haven't changed much ;)
    1. Re:.NET regexps and Microsoft's documentation by malraid · · Score: 0, Redundant

      Well... Hopefully the docs will improve once the Regex Designer for Visual Studio ships ...

      or maybe a regex wizard?

      --
      please excuse my apathy
    2. Re:.NET regexps and Microsoft's documentation by Rui+del-Negro · · Score: 4, Funny

      Microsoft's documentation reads like a novel compared to IBM's. The typical IBM manual has the following format:

      PAGE 1:

      [COMMAND1] is executed by typing the word [command1] followed by the argument string, followed by enter. The argument string consists of a sequence of non-whitespace characters separated by whitespace characters.

      [COMMAND2] is executed by typing the word [command2] followed by the argument string, followed by enter. The argument string consists of a sequence of non-whitespace characters separated by whitespace characters.

      [COMMAND3] is executed by typing the word [command3] followed by the argument string, followed by enter. The argument string consists of a sequence of non-whitespace characters separated by whitespace characters.

      PAGE 2:

      THIS PAGE IS INTENTIONALLY LEFT BLANK

      ...and so on and so on.

      Regarding this last IBM tradition (that others have tried to copy but few have truly mastered), the Spruce DVD Maestro manual has a page with the following text:

      Blank page.
      (mostly)

      RMN
      ~~~

    3. Re:.NET regexps and Microsoft's documentation by DavidJA · · Score: 1, Troll

      Nice to see that things haven't changed much ;)

      I don't know you got modded insighful, I think Troll is a more accurate description.

      Microsoft has the best doco of ANY software development company.

      MSDN Library is the best single reference for everything Microsoft.

      Take a look at it some time.

    4. Re:.NET regexps and Microsoft's documentation by Baki · · Score: 2

      MSDN library is like a garbage can: lots of unstructured suff inside, can't find anything.

      Yes, you can find tutorials, examples and the like, but no FORMAL references and specifications.

      I hate programming by example without knowing exactly what I'm doing and knowing what is inside and what is outside the spec. And lots of documentation only costs time.

      In that respect, the java documentation is excellent. A consicea specification, yet very readable and useful to use during day-to-day programming.

    5. Re:.NET regexps and Microsoft's documentation by AnalogBoy · · Score: 2, Troll

      The only computer books i've ever read which actually read well were "Upgrading and Repairing PC's" (So much so i wrote the author) and "The practice of system and network administration".

      If only all books could be written as well.. *sigh*...

      In-depth... summary. In-depth... Summary.

    6. Re:.NET regexps and Microsoft's documentation by Anonymous Coward · · Score: 0

      Now that Spruce Tech was bought by Apple, they updated that to read:

      Blank page
      (but completely different from blank pages in PC manuals)

    7. Re:.NET regexps and Microsoft's documentation by ergo98 · · Score: 1

      What the hell are you talking about? Clearly you have absolutely ZERO experience with the MSDN Library, but instead you just saw an avenue to spout some pro-Java, anti-MS BS.

      The MSDN Library is very intelligently structured into a hierarchy of logical categories and subcategories, and if you have the local app you can limit your searches to specific branches (or combinations of branches). The documentation is, err, "Formal" references and "specifications", and actually has too FEW examples (nice attempt at trying for the highbrow "Java programmers SMRT, MS programmers dumb". It looks stupid, though, given the grossly incorrect representation of how the MSDN Library works).

      I actually agree with the original guy that the regular expression portion of Microsoft's documentation is weak, and I would say right so : That is something that can be a science unto itself, and the Microsoft documentation couldn't do it justice without dedicating basically a book to it. Instead they give you an intro and point you to other resources. The correct choice in my opinion.

    8. Re:.NET regexps and Microsoft's documentation by Anonymous Coward · · Score: 1, Interesting

      > Microsoft has the best doco of ANY software development company.

      ROTFL! Clearly you've never seen any DEC software manuals. "ANY" is more that a little bit too strong.

    9. Re:.NET regexps and Microsoft's documentation by dillon_rinker · · Score: 2

      Who are you and where did you get my brain?

      The first made me a tech; the second is making me an admin. All the books I've read in between have been MS GUI crap or warmed over help files and TechNet articles.

      (Yeah, yeah, I know it ain't Linux...but it pays the bills)

    10. Re:.NET regexps and Microsoft's documentation by rjamestaylor · · Score: 1

      Philip and Alex's Guide to Web Publishing still ranks as my favorite "computer book" -- one that not only covered technical issues (granted, not at tremendous depth) but theoretical and inspirational ones as well. It's the book that turned me from an application developer to a ... well, whatever I am now :) ...

      --
      -- @rjamestaylor on Ello
    11. Re:.NET regexps and Microsoft's documentation by Tablizer · · Score: 1, Offtopic
      Seen on an IBM Bimbo's forehead:

      This Mind is Intentionally Left Blank

    12. Re:.NET regexps and Microsoft's documentation by Rui+del-Negro · · Score: 2

      Actually, this is off-topic, but I remember reading a list of "possible descriptions for a white sheet of paper" and one of them was "all the original ideas that IBM engineers have ever come up with". Which isn't entirely fair, of course.

      Another description was "Microsoft's moral guidelines". Which is also not entirely fair. Microsoft does have one guideline concerning morals: "if you want to make it in this company, get rid of them".

      RMN
      ~~~

    13. Re:.NET regexps and Microsoft's documentation by SN74S181 · · Score: 1

      A tech? That book makes you a screwdriver operator. I couldn't find the chapter about wire wrap, and there wasn't squat about breadboard grounding techniques?

    14. Re:.NET regexps and Microsoft's documentation by Tablizer · · Score: 2

      (* Actually, this is off-topic, but I remember reading a list of "possible descriptions for a white sheet of paper" and one of them was "all the original ideas that IBM engineers have ever come up with". Which isn't entirely fair, of course. *)

      The best way to think about the complaint is that the ratio of ideas to company size was allegedly lower than "normal". For example, IBM might have been at the time 90 percent of the computer market, but made only about 50 percent of all innovative ideas.

      But DEC and Intel and a little bit of Gov-scare eventually changed all that.

    15. Re:.NET regexps and Microsoft's documentation by benhaha · · Score: 1

      Troll, or Flamebait, not insightful.

      --
      NO ID: BEING FREE MEANS NOT HAVING TO PROVE IT
    16. Re:.NET regexps and Microsoft's documentation by broody · · Score: 1

      Informix documentation remains amazing, despite the IBM buy out. IBM hasn't affected the quality at all.

      --
      ~~ What's stopping you?
    17. Re:.NET regexps and Microsoft's documentation by AnalogBoy · · Score: 2

      That book made me a tech in '98, along with some other studying it got me my A+. I'm now a UNIX Admin, and currently re-reading that book, just for fun. It's interesting to compare the Intel processors with a Sparc or RS/6000 on an internal level..

      Oh, and Linux has about a 10% chance of paying your bills. Have a major enterprise skill set, say, Windows or AIX or Solaris.. and have linux as a secondary skill. Things may be different where you live, but in midsouth.us, Linux is currently is considered "dot-commish" and corps are steering away from it. Things are bound to change when the Itanium comes out and PC Unix means something again.

    18. Re:.NET regexps and Microsoft's documentation by Zathrus · · Score: 2

      What really amazes me is how IBM manages to mangle man pages.

      Apparantly the traditional man pages weren't down to IBM standards, so IBM actually paid someone to rewrite them.

      In order to get man pages that actually have useful information I now have to surf the web. The ones included with AIX 4.3 are so damn useless and content-free that they're actually misleading at times.

  2. When is RegExp2 Going To Be Shipped by N8F8 · · Score: 3, Informative

    Amazon has slipped the shipping date twice. I don't know about you, but this book is definitly a "Must Have".

    --
    "God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
    1. Re:When is RegExp2 Going To Be Shipped by duck_prime · · Score: 4, Funny
      Amazon has slipped the shipping date twice.

      Yes, and that makes me want to use a decidedly irregular expression:
      #@*$^&@#$&#!!!
    2. Re:When is RegExp2 Going To Be Shipped by Anonymous Coward · · Score: 0

      How is the parent offtopic? I miss the days when Slashdot's editors actually understood plain English...

    3. Re:When is RegExp2 Going To Be Shipped by Anonymous Coward · · Score: 0

      I hope I meet this one in meta-mod...

    4. Re:When is RegExp2 Going To Be Shipped by Anonymous Coward · · Score: 0

      You do know that the editors don't moderate, right?

  3. indeed by tps12 · · Score: 0, Troll

    Regexps are interesting, sure. Every CS student enjoys (or suffers through!) the regexp section of their Intro to Computability (or equivalent) course. And it is pretty fun thinking about the expressive power of, say (a|b)*a*b*.

    However, we have to face the facts, that regexps, as good as they are from a mathematical standpoint at matching things, just aren't that helpful in sorting through the sea of data that is the Internet. The input data just aren't orderly enough for regexps to be of any use.

    What has become useful is what Google taps into. And that is the human aspect. Data isn't important because it matches a*(b|c)a*. It's important because it is useful to people. Think about it: when you are looking for wares or porn, where do you go? Perl? Nope. IRC. Why? Because of the human element.

    That is why research into regexps is doomed to failure. It is a dead end. From a theoretical standpoint, regexps are cute and interesting, but for serious data prowling, you need something with a brain and a heart.

    --

    Karma: Good (despite my invention of the Karma: sig)
    1. Re:indeed by Kwikymart · · Score: 2, Funny

      "Think about it: when you are looking for wares or porn, where do you go? Perl? Nope. IRC. Why? Because of the human element ... but for serious data prowling, you need something with a brain and a heart."

      A heart for porn?

      --

      Buying a Dell computer is equivalent to dropping the soap in a prison shower.
    2. Re:indeed by Christianfreak · · Score: 3, Informative

      Who said anything about the Internet? Honestly I didn't read the article but I do have the first version of the book they are talking about and it has nothing to do with the internet rather pattern matching in programming.

      That is why research into regexps is doomed to failure. It is a dead end. From a theoretical standpoint, regexps are cute and interesting, but for serious data prowling, you need something with a brain and a heart.

      While I agree that for large amounts of data you need something other than a regex, but that certainly doesn't mean that regexs are dead or that we shouldn't try to make them better! I don't need Google's search algorithm to make sure my user's input matchs certain parameters and I would really hate to have to write

      if $input contains really_evil_characters() die;

      Regex is here to stay

    3. Re:indeed by kdorff · · Score: 2, Insightful

      Ummm. Are you a programmer? Sure, you don't need Regexp's to solve every problem (and probably don't need them for MOST problems), but there are many problems that are solved so much more elegently WITH regexp's than without that once you understand them, IF you are a programmer, you wouldn't give them up. They are invaluable tool in a programmers toolkit.

    4. Re:indeed by bitweever · · Score: 1

      Who modded this guy up?

      Regex's aren't useful because you don't use them on the Internet? What kind of argument is that?

    5. Re:indeed by pizza_milkshake · · Score: 1
      That is why research into regexps is doomed to failure. It is a dead end.

      yeah... just look at perl, a language that is tightly tied to regular expressions. it's only one of the most-used programming languages on the planet.

      i don't mean to be a regexp or perl nazi, but i've found regexp's to be nothing but useful for the times when a split-by-whitespace isn't enough and a full-fledged parser is too much.

      regexp's won't solve world hunger or cure cancer, but last time i checked there were alot of folks using perl and grep.

    6. Re:indeed by platypus · · Score: 5, Informative

      [talk about regexps are not so usefull..., but ...] What has become useful is what Google [google.com] taps into. And that is the human aspect. Data isn't important because it matches a*(b|c)a*. It's important because it is useful to people. Think about it: when you are looking for wares or porn, where do you go? Perl? Nope. IRC. Why? Because of the human element.

      I understand your thinking.
      But your thinking is wrong.
      Think about it (no pun intended).
      How much better would google be if one could use regexps in one's search request.
      regexp and datamining are orthogonal.

    7. Re:indeed by RisingSon · · Score: 3, Funny
      Hehe...thanks for your funny post.

      Regexps are interesting, sure.

      Not really. I use them all the time and the only time they are interesting is when you're done and they look completely silly.

      Every CS student enjoys (or suffers through!) the regexp section of their Intro to Computability (or equivalent) course.

      Not really. I got a degree in Computer Engineering from the #2 private engineering school in the country and I was never taught regex. If you know how to program and not just crank out syntax, you can pick up regex on your own pretty fast.

      And it is pretty fun thinking about the expressive power of, say (a|b)*a*b*

      That is actually a really boring regex. Lots of a's or b's folowed by lots of a's followed by lots of b's. Wow. My brain is fried.

      However, we have to face the facts, that regexps, as good as they are from a mathematical standpoint at matching things, just aren't that helpful in sorting through the sea of data that is the Internet.

      Wow. You're probably right. I'll bet nothing that searches for things on the internet, such as google.com, uses any regex internally in their code. Now that I'm facing the facts, you're right, regex is worthless when it comes to searching through any amount of data.

      The input data just aren't orderly enough for regexps to be of any use.

      Yeah, regex is best used for very very simple patterns. Anything more complex than your above example is best suited for some serious hand-parsing in visual basic.

      Think about it: when you are looking for wares or porn, where do you go? Perl? Nope.

      I don't know WTF you're talking about. I find ALL my porn at www.perlmonks.org

      That is why research into regexps is doomed to failure.

      Yeah, I should probably throw away all that perl regex code I've written thats made my company lots (and I mean lots) of money in the market. It is doomed. I should writing my pattern matching code in the google.com language.

      Thank you for posting about something you apparently know very little about. Good for an afternoon giggle.

    8. Re:indeed by Anonymous Coward · · Score: 0

      grep and sed are all the regexps I need.

    9. Re:indeed by Anonymous Coward · · Score: 0
      That is actually a really boring regex. Lots of a's or b's folowed by lots of a's followed by lots of b's. Wow. My brain is fried.

      Apparently. '*' doesn't mean 'lots of', it means 'zero or more'. And if your brain wasn't fried, you could have simply stated that it matches any string containing only a's and/or b's (or the null string). And you could have pointed out that the a*b* part was superfluous.

      I got a degree in Computer Engineering from the #2 private engineering school in the country

      Yeah right. That's why you can't parse a regex or close your italics. Care to name the school? And #2 by whose standards?

    10. Re:indeed by Anonymous Coward · · Score: 0

      *giggle* Like your missing the tag?

      Asshole.

    11. Re:indeed by amchugh · · Score: 1
      regexp and datamining are orthogonal.

      This is exaxtly why I miss Alta Vista advanced search. I could frequently drill down with much more precision than Google will allow. The only reason I switched to Google was because they started indexing more pages than AV, and Alta Vista eliminated the all text version of their search page.

    12. Re:indeed by Anonymous Coward · · Score: 0

      You're already looking pretty crispy, but I might as well chime in, too.

      I got a degree in Computer Engineering from the #2 private engineering school in the country and I was never taught regex.

      Note that the original poster was discussing Computer Science, which includes such topics as "pure" regular expressions, or regular expressions as a mathematical concept. While your school is, I'm sure, a fine vocational institution, it is actually quite certain that no one could be said to have a firm grounding in CS without having learnt regular expressions, context-free grammars, and Turing machines.

      If you get the opportunity, you might actually want to check out a Computability class, or at least pick up a textbook. It's fascinating stuff, and reveals that the difference between the expressive power of REs versus CFGs (or Visual Basic, to use your example) has nothing to do with "complexity" (in the loose sense in which you use it), and everything to do with well-known and deep mathematical truths.

    13. Re:indeed by Anonymous Coward · · Score: 0

      Altavista didn't eliminate the all-text search page.
      http://www.altavista.com/text

    14. Re:indeed by psaltes · · Score: 2

      > Not really. I got a degree in Computer Engineering from the #2 private engineering school in the country and I was never taught regex. If you know how to program and not just crank out syntax, you can pick up regex on your own pretty fast.

      To be a little pedantic the original poster probably meant being taught regular expressions in a formal language theory framework, where one talks about properties of computability. The same course would teach things like finite state machines (which in terms of computability are equivalent to classical regular expressions though I think not to perl regexps), context free grammars (pushdown automata) and turing machines, and just general computability (and maybe complexity) theory. All of these things have a great deal to do with how programs work, and the lack of such theory is probably actually one of the drawbacks of doing something like computer engineering over computer science. (at least from my perspective)

    15. Re:indeed by BJH · · Score: 1

      I think he meant "#2 from the bottom".

    16. Re:indeed by WWE-TicK · · Score: 1

      The computer engineering program at my school, UMBC, is exactly the same as the computer science program except you are required to take additional courses which would otherwise count as elective credit if you took CS (stuff like more physics and an electrical engineering course).

    17. Re:indeed by tzanger · · Score: 2

      • And it is pretty fun thinking about the expressive power of, say (a|b)*a*b*
      That is actually a really boring regex. Lots of a's or b's folowed by lots of a's followed by lots of b's. Wow. My brain is fried.

      Actually that regexp matches any text at all. * is 0 or more matches, not one or more. Personally I think the really interesting regexps use lookaheads but that's just me.

    18. Re:indeed by Isle · · Score: 1

      but for serious data prowling, you need something with a brain and a heart.

      Infact skip the heart, you just need something with a brain and common sense.

  4. Perl6 regular expressions - forget everything by Anonymous Coward · · Score: 3, Interesting

    Perl6 is going to radically change regular expressions as well. I guess the term "regular expression" is pretty vague/useless these days. You have to identify the language _and_ its revision to get an accurate idea of the regexp feature set you're dealing with. Just throw some variables and control structures into regexp and we'll have a full-blown extremely cryptic language. Maybe we need a RegExp Institute of Excellence with yearly meetings in Sweden or something.

    1. Re:Perl6 regular expressions - forget everything by cats · · Score: 1

      Just throw some variables and control structures into regexp and we'll have a full-blown extremely cryptic language.

      You just described AWK. /runaway!

    2. Re:Perl6 regular expressions - forget everything by jbolden · · Score: 1


      At least the bulk of version 1 wasn't language specific. I think most of the version 1 would have applied equally well to Perl 5 or Perl 6 (the chapter or Perl syntax being an exception). The Perl stuff in version 1 was more Perl 4ish than Perl 5ish (since Perl 5 was new). I'd gather a good treatment of Perl 6 won't probably be till version 3.

    3. Re:Perl6 regular expressions - forget everything by pauljlucas · · Score: 1
      I guess the term "regular expression" is pretty vague/useless these days. You have to identify the language _and_ its revision to get an accurate idea of the regexp feature set you're dealing with
      That's true only if you confuse "regular expression" as the formal concept with implementation languages. The formal concepts for regular expressions are explained in gory detail in this book among other places. This stuff hasn't changed in decades.
      --
      If you reply, do so only to what I explicitly wrote. If I didn't write it, don't assume or infer it.
    4. Re:Perl6 regular expressions - forget everything by Abreu · · Score: 2

      just throw some variables and control structures into regexp and we'll have a full-blown extremely cryptic language.

      You just described AWK. /runaway!


      I thought that was the whole purpose of Perl! /run real fast!!

      Ok, so maybe I am exagerating on purpose (its called humor, folks... Dont shoot!)
      But it always seemed to me that Perl's cryptic quality mainly came from having "too many" variables and control structures
      (<joke> Theres too many ways to do it? </joke>)

      Thank the Lord for blessing us with Guido

      --
      No sig for the moment.
  5. This has no educational purpose by Anonymous Coward · · Score: 3, Insightful

    Other than to tell us what is different between the two books. After reading the article I walked away with no general knowledge that was useful in using regular expresions, or what might be coming, or where we came from.

    It is a slightly wordy advertisment for why you should upgrade. The fact that it was foisted on us as something else annoys me, as I spent time reading it.

    I know, a slashdot reader that actually reads linked stories is such a minority, but come on, quite stuffing articles with advertising. Aren't the ads in the middle of a page enough?

    1. Re:This has no educational purpose by Anonymous Coward · · Score: 0

      Well, maybe, but to those of us who read (and loved) the first one, this was well and truly newsworthy.

      And since, as you say, the article doesn't say what is good about either book, well all I can say that programmers are split into two camps: those that 'get' regular expressions, can think in a variety of flavours of regex, and wield that power like a fine scalpel, both in the body of their code and in creating the code itself (all good programmers editors (and a few crappy ones) have regex built in). And then there are the others. These are they who may be aware of regex, but they don't think in it, so they don't use it when they should, try to use it when they shouldn't, and when it doesn't work, blame the tool and go do things the incredibly awkward way.

      The bulk of those programmers I know in the former group have read and loved Mastering Regular Expressions. It really is the only text I know that deals exclusively on the subject, and yes, it really is that good.

      As far as advertising goes, well, I could give a bugger if QuickTime 6 is out (like I use it). If it were out for Linux/*BSD, then that would be something, but otherwise, who cares? News of a new version of this book, on the other hand, had me drooling in anticipation.

  6. what about perl 6? by jbennetto · · Score: 5, Interesting

    He doesn't even mention the radical changes to regexps in Perl 6, as described in the recent Apocalypse 5 and Synopsis 5.

    1. Re:what about perl 6? by tswinzig · · Score: 3, Interesting

      He doesn't even mention the radical changes to regexps in Perl 6, as described in the recent Apocalypse 5 [perl.com] and Synopsis 5 [perl.com].

      If you could write and use a Perl 6 program right now, maybe he'd include a chapter on it in his book.

      This article is basically an overview of his book. His book doesn't cover Perl 6 regex's. Why should it? Perl 6 isn't even done yet, and so everything new for Perl 6 could change by the time it comes out.

      --

      "And like that ... he's gone."
    2. Re:what about perl 6? by Anonymous Coward · · Score: 0

      Lets not forget that Larry Wall gets money from O'Rielly. So, they've exhausted every Perl topic in one book or another... what to do? Get him to rewrite the entire language so they can sell more books!

    3. Re:what about perl 6? by Weird+Dave · · Score: 1

      Well, many people have written small snippets of code for perl 6 that could easily be executed, were the language finished. It might make an interesting chapter in the book, but it would be academic since nobody can run it, and the spec might even change.

      --

      Grumble, Grumble
    4. Re:what about perl 6? by ajs · · Score: 2

      If you could write and use a Perl 6 program right now, maybe he'd include a chapter on it in his book.

      heh.

    5. Re:what about perl 6? by embobo · · Score: 1

      At a talk last night Damian said that perl6 is expected in about 18 months.

    6. Re:what about perl 6? by tswinzig · · Score: 2

      From the email:

      it's now Turing-complete, if you have a Parrot engine and a bit of spare time. Call it a primitive "demo version" of some of Perl 6's features.

      So I reiterate... "if you could write and USE a Perl 6 program right now, maybe he'd include a chapter on it in his book."

      heh.

      --

      "And like that ... he's gone."
    7. Re:what about perl 6? by ajs · · Score: 2

      And you can. Your definition of use involves prodouction deployment, does it? Authors of software-related books are well used to using pre-alpha versions of software for research material. I'm sure he would not have as hard a time as you think.

  7. All Hail the Cool-Owl Book! by Icepick_ · · Score: 0, Offtopic

    These owls are much, much cooler than this one.

  8. Contentless article by Shevek · · Score: 2, Insightful

    That is one of the most contentless articles I have seen in a long time.

    A regex is a type 3 grammar. Type 3 grammars haven't really changed since Chomsky's time.

    The smartarses will now proceed to point out that
    a) Perl is actually limited type 2
    b) Some change noone knows or cares about was made to some definition of the Chomsky hierarchy in ninteen dumdy-dum.

    Foo.

    1. Re:Contentless article by Get+Behind+the+Mule · · Score: 5, Insightful
      That is one of the most contentless articles I have seen in a long time.

      A regex is a type 3 grammar. Type 3 grammars haven't really changed since Chomsky's time.
      You get a B-, Bunky. And here's your cookie.

      After you've finished your untergrad CS theory class, you might go on to discover that implementations of regexes under various paradigms and in the various languages have extremely rich variety regarding syntax, semantics and efficiency. This isn't about the pristine theory of Prof. Chomsky, but about the actual use of regexes as programming constructs, and that's a tremendously complex subject. Friedl's book in the first edition is one of the best I've ever seen that has tackled such complexity and made it accessible and useful for the everyday business of programming.

      The article indicates that the practical use of regexes, far from stagnating since Chomsky's time, continues to evolve and grow. That's only "contentless" if you're stuck in the ivory tower and don't intend to leave.
    2. Re:Contentless article by Anonymous Coward · · Score: 0

      this was mod'ed up? And insightful? It's at least "Troll" if not "Flamebait".

    3. Re:Contentless article by iabervon · · Score: 2

      Regular expressions aren't theoretically interesting anymore. Regexps, in the sense of a way of specifying regular (and some non-regular) expressions, shows significant change over time. In much the same way, English isn't theoretically different from Indo-European, but you won't get very far using only Indo-European these days.

    4. Re:Contentless article by You'reAFuckingMoron · · Score: 1
      Obviously, the semantics of regular expressions haven't changed one jot or iota in the last umpteen years. As you point out, what Perl calls "regular expressions" actually aren't, but the limited type 2 grammer that Perl calls "regular expressions" should be immediately recognizable to anyone who's done any text processing in the last 20 years or so.

      But books like this one really aren't about the semantics of regular expressions. It's a cookbook, full of syntax for the different regular expression engines built into a bunch of languages and libraries. Those have changed a lot over the last few years. It's also full of examples of regular expressions, along with a lot of concrete instructions on how to write and read the beasts.

      For the kind of programmer who can tell, just by reading the "perldoc regex" manpage, that Perl regular expressions are really a limited type 2, there really isn't much to find useful in one of these books. For these programmers, some of the example regular expressions are nice, but they're probably not worth $30.

      But there are a lot of coders out there who couldn't tell you what a type 3 grammer is to save their life, and their eyes generally glaze over when they're trying to read the regex documentation for their favorite language. For those coders, a book like this is a godsend, especially if they're doing any amount of text processing. The new version, which describes the current syntax of the languages they're most likely to be using, is probably going to be more useful to these coders than the old version, which describes the old syntax of some other languages.

      The article is only contentless because when you read "regular expression", you think of the semantics of regular expressions, and the syntax used by different regular expression engines is trivial and simple to you. To the people targetted by this book, the syntax is the difficult part of the regular expression. They simply don't have the mathematical sophistication to even think of the semantics of the things in a general way.

      --
      What a fabulous troll your post was.... or how fabulously stupid you are. It's impossible to tell.
  9. at some point... by g4dget · · Score: 4, Interesting
    Beyond a certain degree of complexity, it really doesn't make much sense anymore to use regular expressions--a simple built-in parser generator with executable annotations is both clearer and more powerful. Parser generator syntax allows comments, whitespace, with a simple, fairly standard syntax.

    Perl and other languages should leave "good enough" alone when it comes to regular expressions and instead just make it easy to put chunks of grammars into programs.

    1. Re:at some point... by joshv · · Score: 4, Insightful

      Beyond a certain degree of complexity, it really doesn't make much sense anymore to use regular expressions--a simple built-in parser generator with executable annotations is both clearer and more powerful. Parser generator syntax allows comments, whitespace, with a simple, fairly standard syntax.

      Yes, regular expressions should be used to find particular patterns in text and perform basic manipulations on them. Beyond a certain point of complexity it really doesn't make sense to perform more complex manipulations. Get the information you want out of the string using a regular expression, then manipulate it in code.

      One has a feeling that regexp engines are just becoming programming languages in and of themselves - the only difference being that the 'program' consists of a string of cryptic single character commands, and the input is limited to a single string.

      -josh

    2. Re:at some point... by edrugtrader · · Score: 1, Flamebait

      so... you just started programming perl eh?

      go back to appletalk... FREAK! ;)

      --
      MARIJUANA, SHROOMS, X: ONLINE?! - E
    3. Re:at some point... by Lumpish+Scholar · · Score: 2
      Beyond a certain degree of complexity, it really doesn't make much sense anymore to use regular expressions... Parser generator syntax allows comments, whitespace, with a simple, fairly standard syntax ...
      ... and (as you'd certainly know if you'd read either edition of Friedl's book) that's also of Perl 5 "regular expressions"; and Friedl strongly encourages you (e.g., by example) to write complicated regular expressions that way.
      --
      Stupid job ads, weird spam, occasional insight at
    4. Re:at some point... by Anonymous Coward · · Score: 0

      Maybe not to you. I've got no problem if you want to write a whole bunch of code -- but don't take away regex's from the rest of us who want to match complex patterns with a single line. Our lives are easier for it.

    5. Re:at some point... by Anonymous Coward · · Score: 1, Interesting
      Perl and other languages should leave "good enough" alone when it comes to regular expressions and instead just make it easy to put chunks of grammars into programs.

      heh.

      To be slightly more detailed, you cite the following limitations in regular expressions:
      • Comments - Perl5 allows comments in regular expressions, but the syntax is clunky. The Perl6 general comment syntax and regular expression comment syntax will be unified and simple.
      • Whitespace - In Perl5 you can already use whitespace to format your regular expressions by using the "x" modifier. In Perl6 this will be the non-optional default (which makes more sense given unicode anyway)
      • Simple syntax - I'd much rather have a rich syntax and simple code, personally.
    6. Re:at some point... by Pseudonym · · Score: 3, Funny
      One has a feeling that regexp engines are just becoming programming languages in and of themselves [...]

      Not true. Yet.

      Perl 5 regexes can solve NP-hard problems, but they're not quite Turing complete. However, they require only four additional stack operators to do that.

      Personally, I'm waiting for the first Perl regex to become sentient.

      --
      sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
    7. Re:at some point... by Anonymous Coward · · Score: 0

      Are you reading impaired? He didn't say "take away" he said "leave them alone". They are good enough the way they are. Making them any more complex makes them less useful.

    8. Re:at some point... by flonker · · Score: 2, Funny

      Feb 20, 2042 - The day that the first true sentient artificial intelligence is created.
      Feb 21, 2042 - The day it gets converted into a Perl one-liner.

    9. Re:at some point... by g4dget · · Score: 2

      The point is that I don't want Perl6 regexps. I'm happy with Perl5 regexps. For the things that are hard to do with Perl5 regexps, I don't want an even more complicated regexp package, I want a simpler parser generator.

  10. regexp and programmers by revscat · · Score: 4, Insightful

    Over the course of my career I have come to the rather firm opinion that you are not worth much as a coder if you do not know regular expressions. I don't care what language(s) you're proficient in, or if you've memorized every single design pattern the GoF has ever conceived, of do 4 foot by 6 foot UML diagrams in your head. If you can't do regexps then you're missing a basic skill. I bought Friedl's book a couple of years ago, and although I wound up not using man of the Perl related stuff the rest of the book helped me out immensely.

    A programmer without knowledge of regular expressions is like a carpenter without a hammer.

    1. Re:regexp and programmers by Anonymous Coward · · Score: 4, Insightful

      A programmer without knowledge of regular expressions is like a carpenter without a hammer.

      If ever there was an apt analogy of regular expressions - that's it! They make everything seem like a nail ;).

    2. Re:regexp and programmers by Jonny+Ringo · · Score: 0, Offtopic

      my $poophead = "Over the course of my career I have come to the rather firm opinion that you are not worth much as a coder if you do not know regular expressions. I don't care what language(s) you're proficient in, or if you've memorized every single design pattern the GoF has ever conceived, of do 4 foot by 6 foot UML diagrams in your head. If you can't do regexps then you're missing a basic skill. I bought Friedl's book a couple of years ago, and although I wound up not using man of the Perl related stuff the rest of the book helped me out immensely.
      A programmer without knowledge of regular expressions is like a carpenter without a hammer."

      $poophead =~ s/\sI\s/me the poophead/;

      print "$poophead\n";

    3. Re:regexp and programmers by Anonymous Coward · · Score: 0

      I agree that regexps are nice, and a good thing to learn.

      But I more firmly believe that a programmer that has not mastered C and an assembly language is missing out.

    4. Re:regexp and programmers by Anonymous Coward · · Score: 0

      Perl is for motherfuckers. You're wasting so many cycles. Type this at the shell:

      cat << "EOF" | sed -e 's/I /me the poophead /g'
      Over the course of my career I have come to the rather firm opinion that you are not worth much as a coder if you do not know regular expressions. I don't care what language(s) you're proficient in, or if you've memorized every single design pattern the GoF has ever conceived, of do 4 foot by 6 foot UML diagrams in your head. If you can't do regexps then you're missing a basic skill. I bought Friedl's book a couple of years ago, and although I wound up not using man of the Perl related stuff the rest of the book helped me out immensely.

      A programmer without knowledge of regular expressions is like a carpenter without a hammer.
      EOF

    5. Re:regexp and programmers by pHDNgell · · Score: 2, Troll

      Perhaps if you are looking for perl programmers who will need to be doing a lot of textual processing, but that's definitely not the case in other areas.

      I prefer to work with people who don't do a lot of regex, because they're less likely to use them for everything. I haven't worked on a large project that used regular expressions in years. I feel pretty good about that.

      Sure, I've used them in a couple small scripts for parsing text, but if you see the majority of programming requiring regex, you definitely need to put your hammer down and pick up a Makita.

      --
      -- The world is watching America, and America is watching TV.
    6. Re:regexp and programmers by revscat · · Score: 3, Interesting

      Sure, I've used them in a couple small scripts for parsing text, but if you see the majority of programming requiring regex, you definitely need to put your hammer down and pick up a Makita.

      Well, I am certainly not advocating the broad use of regexps in application programming, even though it has been demonstrated to be possible. For me, regexps are an important tool in solving side issues/behind the scenes work, such as formatting a series of configuration files in a given manner, or making broad changes to a set of HTML files, and so forth. I don't do Perl, and don't really like to if I can avoid it, but I still use regular expressions on a daily basis, and have found them to be immensely helpful.

    7. Re:regexp and programmers by Anonymous Coward · · Score: 0

      Ditto! Manipulations across multiple configuration or HTML files. (Personally I use ed because it's handy.)

    8. Re:regexp and programmers by AndrewHowe · · Score: 2

      I know regular expressions, but funnily enough I almost never need them. Occasionally I will do a regexp search if an exact search is not good enough. I don't have Perl installed, and I can't say I have ever needed it.
      I guess they are OK if you do a shitload of text processing, but my philosophy is that data should be processed in native (i.e. binary) form and text should only be used for interchange purposes. Even in that case, you can use text "protocols" such as XML, for which regexps are useless. So... If you have a buttload of (fairly) unstructured data to import... Knock yourself out. It doesn't happen to me. Text processing just isn't an issue for me. I don't think that makes me any less of a coder. My domain is simply different to yours.

    9. Re:regexp and programmers by mcrbids · · Score: 2

      HEAR YE, HEAR YE!

      You speak WISDOM...

      I remember a while back, one of my clients needed to move a bunch of dns records from one server to another. Took me ~ 45 minutes to write a php shell script using REGEX to create new bind zone records for over 300 domains, and convert them - records intact, complete, ready to restart named.

      This poor client had paid somebody else to do it, they spent several DAYS at it and there were still lots of (human) mistakes.

      And, this wasn't complicated stuff!

      Any programmer who doesn't know regex is crippled!

      --
      I have no problem with your religion until you decide it's reason to deprive others of the truth.
    10. Re:regexp and programmers by Gnulix · · Score: 1

      cat

      Unnecessary use of cat - shame on you!!!

    11. Re:regexp and programmers by You'reAFuckingMoron · · Score: 1
      Strange. I use finite automata all the time. Heck, the state diagram is, to me, one of the most useful parts of just about any of the UML documentation I ever read. And it is immediately obvious when you look at any binary file specification if the person who wrote it knows diddly-doodoo-squat about regular expressions.

      People who know regular expressions -- I mean, acutally understand the things, and explain the theory behind them -- are almost always better coders than the people who don't. This is true across amost all domains. In general, people who don't recognize the link between finite automata and regular expressions, and don't believe finite automata have any place in their chosen domain, are generally just really shitty coders. Sorry.

      --
      What a fabulous troll your post was.... or how fabulously stupid you are. It's impossible to tell.
    12. Re:regexp and programmers by binner1 · · Score: 1

      Could this be the difference between someone with a Computer Science degree, and someone who just graduated from The Devry Institute??

      Sorry, but those fly-by-night, degree in a day places really bug me.

      -Ben

    13. Re:regexp and programmers by Corrado · · Score: 2

      Whenever I interview someone for a position I always ask about any "obscure" progamming languages or concepts. Perl, RegExps, Python, Scheme, Lisp, etc... It's not if they know/use the language it's how they answer the question. If they say that they don't know anything about it, that tells me that their toolbox is kinda light. These people are usually MCSEs.

      Once, I mentioned regular expressions in a room full of expensive contracters and full time employees and everyone looked at me like I had suddenly grown an extra head. I was shocked and dismayed. I'm surrounded by amatures.

      --
      KangarooBox - We make IT simple!
    14. Re:regexp and programmers by Electrum · · Score: 2

      I remember a while back, one of my clients needed to move a bunch of dns records from one server to another. Took me ~ 45 minutes to write a php shell script using REGEX to create new bind zone records for over 300 domains, and convert them - records intact, complete, ready to restart named.

      Forty five minutes? Wow. Had you been using djbdns, you could have been done in thirty seconds. The BIND zone file format is needlessly complex.
    15. Re:regexp and programmers by Just+Another+Perl+Ha · · Score: 1
      Actually... I think it's the difference between someone with an academic* computer science degree and someone with a real engineering degree.

      Theory only takes you so far. Practical application of that theory is where the rubber meets the road.

      * I often relish the irony of the many definitions of this word :-)

    16. Re:regexp and programmers by AndrewHowe · · Score: 2

      Regular expressions are implemented with finite automata. Finite automata are not implemented with regular expressions. Regular expressions are a red herring here... It is finite automata that you actually want your coders to understand.
      The rest of your argument is just hand-waving, "almost always", "in general", "generally"... Very weak.

    17. Re:regexp and programmers by binner1 · · Score: 1

      Any academic program should encompass Automata Theory, which in turn includes Regular Expressions, no?

      -Ben

    18. Re:regexp and programmers by K-Man · · Score: 2

      Or rather, they make one want to hit something.

      --
      ---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
    19. Re:regexp and programmers by Anonymous Coward · · Score: 0

      Regular expressions are implemented with finite automata. Finite automata are not implemented with regular expressions.

      They are exactly the same thing, dude.

    20. Re:regexp and programmers by AndrewHowe · · Score: 2

      Then why do we have different names for them?
      Back to school for you!

    21. Re:regexp and programmers by Tet · · Score: 2
      I know regular expressions, but funnily enough I almost never need them.

      Utterly bizarre. I use them every day. Not necessarily in code I write (where admittedly, I've used them pretty infrequently). But in everyday tasks -- I couldn't live without sed and grep, and in particular, how can you write code in any editor that doesn't support regexp search and replace? Doesn't that make you hideously unproductive?

      --
      "The invisible and the non-existent look very much alike." -- Delos B. McKown
  11. Disagree, Personal Experience by N8F8 · · Score: 5, Informative

    Case in point: Six months ago I was handed a printed copy of our family that was to be published by my late uncle. About 1500 pages of history and geneology. After using a scanner and OCR to get the raw text I used Regular Expressions to:

    1) Identify heirarchical relationships that were only denoted by standard oldered list types (1,1a,2,2a,3, I, II, etc).
    2) Insert html markup to reproduce proper highlighting for names and indented lists.
    3) Generate internal HTML links between individuals, their unique GEDCOM (LDS Geneology)number within the document.
    4) Build an index for chapters and an appendix to link from name, sorted bu surname back into the main document.
    5) Add special markup for converting the end HTML into indexed and linked PDF using HTMLDoc.

    Time to complete the job -2 Weeks. Without the use of Regular expression this task would have been alsmost impossible and all my Uncle's work he did to put the information together for the last two years of his life would have been lost.

    --
    "God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
    1. Re:Disagree, Personal Experience by Anonymous Coward · · Score: 0
      Case in point: Six months ago I was handed a printed copy of our family that was to be published by my late uncle.

      Sigh. I keep waiting for the paperless society to come about, but it doesn't look like it's going to happen at this rate.

    2. Re:Disagree, Personal Experience by Anonymous Coward · · Score: 0
      all my Uncle's work he did to put the information together for the last two years of his life would have been lost
      Would he also put 2 years writing into Hungarian/Dutch notation?

      Smack your uncle for me, would you?

    3. Re:Disagree, Personal Experience by rnd() · · Score: 2

      You may want to publish the code you wrote (under the GPL, of course). I'd like convert some family history books to electronic form!

      --

      Amazing magic tricks

  12. Regular Expressions Haven't Changed by jhunsake · · Score: 3, Interesting

    Regular expressions haven't changed since the seventies, at the latest. Now if you want to say that implementations of regular expressions are advancing, fine. Let's be precise in our use of language, or not.

    1. Re:Regular Expressions Haven't Changed by bunratty · · Score: 1
      Regular expressions haven't changed since the seventies, at the latest.
      You're wrong. Regular expressions have gained new features over the years. Positive and negative look-behind are two features that come to mind.

      Perhaps you're thinking of the classical computer-science "regular expession" that really hasn't changed. The book and article cover another type of regular expression that is more powerful than this classical concept. It's unfortunate that two closely related but different concepts share the same term.

      --
      What a fool believes, he sees, no wise man has the power to reason away.
    2. Re:Regular Expressions Haven't Changed by joto · · Score: 3, Interesting
      Well, that's true because regular expressions is nothing but a compact way to describe a deterministic finite state-machine. On the other hand, regexps are not. Regexps has nothing at all to do with deterministic finite state machines, except for the fact that the syntax is inspired by them.

      PS: Note the difference between "regular expression" which is what they teach you about in CS classes, and "regexps", which is what programmers actually use in Perl and many other languages.

    3. Re:Regular Expressions Haven't Changed by jhunsake · · Score: 1

      Fine, then why is the title of the book "Mastering Regular Expressions" and not "Mastering regexps"?

    4. Re:Regular Expressions Haven't Changed by You'reAFuckingMoron · · Score: 1
      Note the difference between "regular expression" which is what they teach you about in CS classes, and "regexps", which is what programmers actually use in Perl and many other languages.

      The fact that you appear to make a distinction between "deterministic" and "non-deterministic" finite state-machines makes me think that you have absolutely no idea how Perl regexps differ from the regular expressions we all learned about in CS class.

      Here's a challenge -- post a Perl regexp that is not a classical regular expression. I'm betting that 90% of the people posting their useless drivel to this topic couldn't do it.

      --
      What a fabulous troll your post was.... or how fabulously stupid you are. It's impossible to tell.
    5. Re:Regular Expressions Haven't Changed by Anonymous Coward · · Score: 0

      Any regexp that uses backreferences is not regular.

      So /(.)\1/ is what you are looking for.

      But that isn't impressive, so why don't I give you a Perl regexp that does something a lot of people say is impossible?

      #! /usr/bin/perl
      my $braces;
      $braces = qr/(\((?:(?>[^\(\)]+)|(??{$braces}))*\))/;

      while () {
      if ($_ =~ $braces) {
      print "Matched '$1'\n\n";
      }
      else {
      print "No match\n\n";
      }
      }

      Yup. Perl (as of 5.6) has a regular expression that will detect balanced parens. (Of course you probably don't know enough about what you are blathering about to realize that regular expressions aren't supposed to be able to do that.)

    6. Re:Regular Expressions Haven't Changed by You'reAFuckingMoron · · Score: 1

      Well, what you posted really only finds a set of balanced parens in a string -- it doesn't really "detect" if it's balanced. To do true detection (without the independent subexpression), you'll need to do something like this (and, of course, this is a Perl regexp, not really a "regular expression" as the poster several levels up defined the terms):

      #!/usr/bin/perl -w

      use strict;

      my ($balance, $re);
      $braces = qr/([^()]*|\((??{$balance})\))*/;
      $re = qr/^(??{$balance})$/;

      while (<>) {
      chomp;
      if ($_ =~ $re) {
      print "balanced parens!\n\n";
      }
      else {
      print "unbalanced parens\n\n";
      }
      }

      --
      What a fabulous troll your post was.... or how fabulously stupid you are. It's impossible to tell.
    7. Re:Regular Expressions Haven't Changed by You'reAFuckingMoron · · Score: 1
      crap.

      :5s/\$braces/$balance/

      --
      What a fabulous troll your post was.... or how fabulously stupid you are. It's impossible to tell.
    8. Re:Regular Expressions Haven't Changed by bunratty · · Score: 1
      Here's a challenge -- post a Perl regexp that is not a classical regular expression.
      Easy! /(a*)b\1/
      It matches strings with an equal number of a's after the b as come before the b. For example, b, aba, aabaa, aaabaaa, etc.

      You can't do that with a classical regular expression, because they can have only a bounded amount of "memory". The capturing parens work around this limitation by storing an unbounded amount of substring.

      In classical regular expressions, determinitic and non-deterministic finite state machines are equivalent. In regexes, however, deterministic and non-deterministic engines work differently, and it's often important to know which kind you're using.

      --
      What a fool believes, he sees, no wise man has the power to reason away.
    9. Re:Regular Expressions Haven't Changed by sesquiped · · Score: 1

      Er.. the one you posted, /(.)\1/ is regular. Perhaps you meant /(.*)\1/?

    10. Re:Regular Expressions Haven't Changed by bellings · · Score: 2

      Positive and negative look-behind are two features that come to mind.

      Without doing a formal proof, I'm still fairly certain that positive and negative look-behind are still equivilant to classical regular expressions. Backreferences, however, make perl regular expressions into another beast entirely, as do independent subexpressions (i think), code refs, and postponed sub expressions.

      --
      Slashdot is jumping the shark. I'm just driving the boat.
    11. Re:Regular Expressions Haven't Changed by joto · · Score: 2
      The fact that you appear to make a distinction between "deterministic" and "non-deterministic" finite state-machines makes me think that you have absolutely no idea how Perl regexps differ from the regular expressions we all learned about in CS class.

      Well, I guess your professor also told you that regular expressions could be used for pattern matching in computers (not just generating strings that were members of a language L). And in this case, that there were two alternatives for implementation, either as a deterministic of non-deterministic automata. And since non-deterministic automata can be converted to a deterministic one by some simple rules, that leaves only one reasonable alternative for implementation of regular expression pattern matchers on computers: the deterministic one.

      There are two problems with this: First of all, the conversion from deterministic to nondeterministic automata can lead to a state-explosion, and second, you might want to add new features to a regexp engine making it recognize more than just what can be described by a regular expression, and this can be easier to do, if your implementation does not use the classical deterministic finite state machine as implementation. Some implementations (Perl) choose to say the use non-deterministic regexp engines, and while that might be formally meaningless, it gives a pretty good idea of how it works informally.

    12. Re:Regular Expressions Haven't Changed by bunratty · · Score: 1
      Without doing a formal proof, I'm still fairly certain that positive and negative look-behind are still equivilant to classical regular expressions.
      Still, these are new features that have been added to regexes in the past few years. Even if they don't offer any additional power, they at least make regexes easier to read in practive.
      --
      What a fool believes, he sees, no wise man has the power to reason away.
    13. Re:Regular Expressions Haven't Changed by cifey · · Score: 1

      The basics are the same with a few addons, but when implementing new regex packages its useful to know performance issues and usage caveats.

      --
      Hello Cruel World
  13. Negative numbers by dillon_rinker · · Score: 2

    Time to complete the job -2 Weeks
    That's pretty cool...regexps let you finish jobs two weeks before you start them. /me ducks and runs

  14. Ummm.... by MemeRot · · Score: 4, Funny

    "Let's be precise in our use of language, or not."

    Very compressed contentlessness.

  15. soon, one book won't do by Anonymous Coward · · Score: 0

    After reading Larry's Apocalypse 5 and Friedl's article on his second edition, I think that pretty soon one book won't be sufficient for this topic. It'll have to be broken down by language.
    Incidentally, although I see some of the benefits that Perl 6 will offer, I'm still upset about having learned all that arcane syntax, only to see it deprecated by Larry in favor of new arcane syntax. He has obsoleted half of my book shelf.

    1. Re:soon, one book won't do by Anonymous Coward · · Score: 0

      He has obsoleted half of my book shelf.

      You realise that O'Reilly actually employ Larry Wall - so that's not a surprise.
      If he doesn't make vast sweeping changes to Perlk every copuple of years, then O'Reilly's revenue stream would dry up...

  16. Where's Clippy when ya need him .. by TheViffer · · Score: 5, Funny

    "I see that you are writing a regular expression"

    --
    -- Knowing too much can get you killed, but knowing who knows too much can make you rich.
    1. Re:Where's Clippy when ya need him .. by fava · · Score: 3, Funny

      Shouldnt that be:

      "I see that you are swearing, would you like to use a thesaurus"

  17. Getting started with regular expressions by paj1234 · · Score: 5, Informative

    I have the first edition of "Mastering Regular Expressions" and it is indeed a very fine useful book.

    For a nice way to get started with regular expressions I recommend the wonderful "txt2regex" console program. It provides a simple text based wizard-like interface. You answer questions and the program builds your regular expression for you. See:

    http://txt2regex.sourceforge.net/

    1. Re:Getting started with regular expressions by civilizedINTENSITY · · Score: 2

      "apt-get install txt2regex" and my use of regex has changed forever! Wonderful little program!

  18. "regular expression" (was: Contentless article) by Lumpish+Scholar · · Score: 2
    A regex is a type 3 grammar. Type 3 grammars haven't really changed since Chomsky's time.
    The smartarses will now proceed to point out that
    a) Perl is actually ...
    ... using the phrase "regular expression" to describe something quite different that "the stuff that's computationally equivalent to a finite state machine" or "the kind of thing Kleene worked on"; imprecise, but most people know what you mean when you say it.
    --
    Stupid job ads, weird spam, occasional insight at
  19. Mod parent up! by Dr.+Zowie · · Score: 2

    Where are moderator points when you need 'em?

    To those who can't read (or write) them, regular
    expressions look like line noise. But once you learn to read them you can condense whole paragraphs of spaghetti conditionals into a single, clear (to the initiated), terse line.

    For manipulating strings of characters, they are probably the single most important innovation of the last 20 years.

    1. Re:Mod parent up! by lars · · Score: 1

      For manipulating strings of characters, they are probably the single most important innovation of the last 20 years.

      Regular expressions have been around since the 1950's.

    2. Re:Mod parent up! by mikec · · Score: 3, Informative

      Regular expressions were certainly an important innovation, but they're a lot more than 20 years old. They were first studied by Kleene in the mid-1950's. The first algorithm to translate them into DFA's was invented in about 1960. Lex was written in the mid 70's.

  20. Now, if only Google would support regexp search... by peterzen · · Score: 1

    ...we could find everything we wanted, not only nearly everything we're looking for on the internet. Imagine the power of pattern matching (ie Perl compatible regexps) on Google's database!

    Seriously, how hard do you guys think it would be to implement such a feature? I can't imagine they haven't played with the idea -- they've probably estimated the need for additional resources, and thrown it right away :-)

  21. Contentless posting by A+nonymous+Coward · · Score: 2

    Maybe to a certain small class of people, "regular expression" means what you want it to mean. To 99.99% of the people who use the phrase, it means what the book describes, and those things have changed considerably.

    Many precise mathematical or scientific terms have different meanings to laymen. What is a positive number? I'm sure I learned whether 0 is a positive number way back when, but right now it simply doesn't matter. Context is usually good enough, and when not, > and >= work wonders. Quantum leap as used by mere mortals has the meaning of incredible revolutionary exciting change, but scientifically, it means the smallest possible change.

    So foo to you.

  22. regex: wrong tool for the job by pHDNgell · · Score: 1
    One has a feeling that regexp engines are just becoming programming languages in and of themselves - the only difference being that the 'program' consists of a string of cryptic single character commands, and the input is limited to a single string.

    You forgot to mention that it also does what you're trying to accomplish as a side-effect of what you actually told it to do.

    That's the thing that bothers me the most. The examples people are giving here are symptoms of the real problem. If you want to know if this piece of data you're looking at is a number, use a programming language that supports numbers and see if you can make a number out of it, then use the number.

    Want to know if your input contains ``nasty characters?'' Wrong, you don't. You want to know if it's what you are expecting. At best, you want to know if it contains only characters you have authorized in a proper sequence that makes sense. You can sometimes do this with a regex, but not treating your data as strings in the first place gets rid of most of the desires to do textual processing on it. It makes your application development easier, more correct, and easier to read. It also makes it less likely to need to be modified down the road when you realize you overlooked unicode conversions, null characters, or other things that cause security holes.

    --
    -- The world is watching America, and America is watching TV.
  23. Validate XML? by King+of+the+World · · Score: 0, Troll

    Is there a regexp to validate XML?

    1. Re:Validate XML? by bunratty · · Score: 5, Informative
      Is there a regexp to validate XML?
      No, you cannot even tell if XML is well-formed with a regex. The reason is that it takes an unbounded amount of memory to remember which tags are still open, but regexs have only a bounded amount of memory.

      One of the important aspects of using regexes is to know their limits and not try to use them outside of those limits.

      --
      What a fool believes, he sees, no wise man has the power to reason away.
    2. Re:Validate XML? by Anonymous Coward · · Score: 0

      No, since XML is defined by a context free grammar which defines the context free language of XML. And since regular expressions define regular languages, which are a subset of the context free languages.

      Summary: regular expressions are strictly less poerful than context free grammars.

    3. Re:Validate XML? by King+of+the+World · · Score: 1
      Oh arsebuckets. I'm doomed... dooomed...

      Regardless, I knight thee sir bunratty.

    4. Re:Validate XML? by Anonymous Coward · · Score: 0

      but regexs have only a bounded amount of memory.

      oh, unlike other things running on real computers in the real world?

    5. Re:Validate XML? by BJH · · Score: 1

      I'm so glad to know my PC has an infinite amount of memory to parse XML! Jeez, I guess I didn't need to buy that extra stick of RAM after all.

    6. Re:Validate XML? by Anonymous Coward · · Score: 0

      No. tags match is same as palindrone. dfa can't count forever

    7. Re:Validate XML? by Anonymous Coward · · Score: 0

      This is a strange way of explaining things.

      What you said is correct in that every regexp has an equivalent finite state machine (and vice versa.)

      I would put it this way. One consequence of the pumping lemma for regular expressions is that you cannot use them to check that brackets match up. Hence you cannot use them to parse XML (although I see no reason why you couldn't come up with a regular-type language for searching trees, which might be useful for XML processing.)

    8. Re:Validate XML? by transiit · · Score: 2

      This is a pretty lousy explanation of how things work, although I do agree that regular expressions have limits and you should be familiar with what they are.

      I might not get every single bit of the technical vocabulary right on this one, but do try to follow along anyhow (and please only refute the REALLY glaring errors).

      Basically, with regular expressions, you get what Chomsky (famed linguist and political extremist...er, nut =) ) referred to as a type-3 grammar, or roughly something that can be solved with a deterministic finite-state automata (DFA. Ok, you might argue that you get an NFA (nondeterministic finite-state automata), but using the subset construction, it's so easily converted to a DFA, we'll just pretend we're working with a DFA.)

      Basically, a DFA works like this: Think of a table. You start out at one row (the starting state) and based on the input you get, you move to another row/state. One or more of these states is specially marked as an accepting state, so if you run out of input characters on one of those states, the string is accepted and everyone is happy. If you run out of input on any state not marked as such, the string is rejected. (DFA's are often expressed as graphs as well, but from a programmatic perspective, it's really easy to just use a table, or for the pedantic, a list of nodes (containing the current state, and where to go on any possible input...a list of lists)).

      Maybe we can go more simple than that: You're sitting on a nerdly board game. You draw a card that says "B. Go to the square labeled as 'R'", so you go to R and draw another card. You keep picking cards and following the directions on them until you get "$: The game is over. If you're on a square that is labeled with an underlined letter, you win. If the letter is not underlined, you lose."

      So what does this mean? In terms of compiler/language theory, we can use a regex to recognize tokens (or individual words), but they aren't very powerful when it comes to syntax (Our lexer would be happy with "sentence This a is.", but by our grammar, it doesn't make a lot of sense. We, as people, could guess the meaning, but computers are still really bad at guessing anything (especially your weight). A parser would be necessary to figure out if things make any sense by the rules of the grammar, which would refuse "sentence This a is." but accept "This is a sentence.") If you're setting up the rules of our example nerdly board game, you could set up a number of states that could find any word in your language. (If the first state is "E" and the next state is "a" and the next state is "t", followed by "$" (commonly used for end of input, but in your language you might specify the end of the word being a space or some bit of punctuation instead), you'll have successfully parsed "Eat", which by your rules is considered to be a valid accepting state. In the same sense, if you pick "R" followed by "Z", you might move off to some error state you've specified, where no matter what input follows, you'll always loop back around to that same state, because you know for certain there's no word you want to accept that starts with "RZ".)

      So to answer the question "can regexp validate XML?", the answer is yes, in the sense that it can be used to scan for valid XML components (words), and no, in that it can't tell well-formed XML from poorly-formed XML tags (sentences). A regex alone isn't quite powerful enough to understand that ">>>>XML" and "<XML>" aren't both perfectly acceptable.

      Sort of.

      Could you write enough rules that some really large set of regex could do it? Maybe, but it's a mathematical proof that's way out of my league, but I'll warn you now: you'll be writing so many cases for every possible permutation that you'll probably go batty trying. Part of what all this language theory got us was an understanding that some tools are good at one task, but lousy at others.

      If you're interested in this further, the Dragon book (search for it on google, you'll find it as "dragon book" faster than its real title, which I've forgotten) is considered the canonical source for this sort of thing, although it can be horribly dry and hard to read. There are some other compiler theory books out there, and some aren't quite as dull (though arguably less informative. I wasn't able to prove my nerdliness by reading more than a handful of pages of the dragon book, though I found it to be a great reference for filling in the gaps of the other books (which were more prone to shameless hand-holding))

      comp.compilers can be a good source as well, though sometimes a bit intimidating. Read through it, see if you can find references to the stuff you don't really understand, and just try to absorb what's there.

      -transiit

    9. Re:Validate XML? by Pembers · · Score: 2, Informative

      You're correct in saying that regexps alone can't validate XML (or any hierarchical structure, come to that). This is an instance of the bracket-matching problem: given a string composed of opening and closing brackets that can nest, determine whether the string is properly balanced or not. For instance, ()() and (()()) are balanced, while (() and (())) are not.

      The reason that a regexp can't do this is that it can't keep track of which opening brackets haven't been closed. A regexp has no memory of what it's already seen. All it knows is what state it's in now, and what token is coming next. OK, some programming languages implement regexps in such a way as to provide some sort of memory of what's been seen, but these usually feel like kludges.

      If you're prepared to put up with an arbitrary limit on how deeply you can nest brackets, then you can solve the bracket-matching problem with an automaton that has N states, numbered 1 to N. If the automaton is in the state numbered x, that means that it's seen x opening brackets that haven't been closed yet. The instructions for each state would be "if you see an opening bracket, go to state x+1, if you see a closing bracket, go to state x-1, and if you see the end of the string, it isn't balanced." Exceptions would be that in state 1, if you see the end of the string, it's balanced, and if you see a closing bracket, it isn't balanced. In state N, if you see an opening bracket, the brackets are nested too deeply.

      Of course, no theoretical computer scientist would ever accept arbitrary limits on how deeply a structure could be nested, which is why you would use a context-free (aka type 2) grammar to solve problems like this one.

    10. Re:Validate XML? by transiit · · Score: 2

      yeah, after rereading my comment a day later, I realize I did slip up in part of my explanation: If your first state is "E", the next character to input is "a", you would go to state "Ea", and with "t", go to "Eat", which is listed as an accepting state.

      Our good friend (or foe, depending on what you're trying to prove) the pumping lemma does give us an idea that N 'A's followed by N 'B's is impossible in this context (or A^n followed by B^n isn't regular).

      You've got the easy hack of "Take all possible tokens. Create an additional set of states for each where there's an arbitrary number of parentheses, brackets, etc. in front of them and the same number behind them" (which would be really big, unless you've got a really small list of tokens or cut off the number of parentheses, etc. to a really small number, or both.)

      A^nB^n doesn't work. A^mB^n does work, provided you don't try to be too specific about what M or N are, just that they're some nonnegative number.

      On another note, it's good news to me that I did get most of the meat of it right. Glad to hear I'm not getting the second-rate education I'd feared. =)

      -transiit

    11. Re:Validate XML? by Pembers · · Score: 1

      Yes, the number of states would rapidly become unmanagable if you tried to hack together a finite state automaton to recognise a language such as A^nB^n for limited n. Actually, the problem isn't so much the number of states as the number of transitions between them, which is roughly the number of states multiplied by the number of symbols in your alphabet. This is sometimes known as the state-symbol product.

      I suppose you could use some sort of regexp compiler to take the grunt work out of it. Ah... any programming language that has regexps has one of those built in anyway. Well, being able to write something like /A^nB^n/(n<=3) would be an advance on /^|AB|AABB|AAABBB$/, I suppose.

      On the subject of education, I did a one-term course on theoretical computer science in general, and another specifically on formal languages. I can safely say they've been of no direct, practical use, but they provided a good foundation to what I studied in other courses and what I've learned since graduation.

      Perhaps it's not important for everyone in computing to know what a DFA is, or understand the pumping lemma or the Church-Turing hypothesis, but it's necessary that some of us do. Otherwise, who will write the compilers for the next generation of programming languages? Who in the team that builds one of those compilers will tell the PHB that (the general case of) the halting problem is unsolvable, so that the marketing department really shouldn't be claiming that it will be able to detect infinite loops in a user's program before running them?

  24. behind-the-scene purpose by jfriedl · · Score: 5, Informative

    The original poster says that the "behind-the-scene purposeis apparently to push a new book that O'Reilly published this month". Actually, that's pretty much the main point of the article -- to justify the need for a second edition, and to let people know what they'd get (or, if not interested, what they're passing on).

    I wrote the article so that people would have a feel for what's new in the book. Of course, my hope is that people are interested in the new content, but my general feeling is that the worst that can happen is that someone buys the book and finds out that it's not what they expected. Unmet expectations pretty much suck, and I hope the article helps avoid some of that suckage.... and piques some interest, as well.

    Jeffrey

    1. Re:behind-the-scene purpose by rjamestaylor · · Score: 1
      I just want to say, "Thanks!" My life is so much easier now that I have half an idea how to use regexp (after buying and reading the first book).

      My former boss, a techie who became CEO, told me the day I was hired that if I wanted to become a guru I needed to master regular expressions (and awk, but that's another, and pre-perl, story).

      --
      -- @rjamestaylor on Ello
    2. Re:behind-the-scene purpose by imr · · Score: 3, Interesting

      thanks for your book.
      Everybody here and there is going to say how informative it is. But, what stroke me the most, is that it is well written.
      It was very pleasant to read it, apart from the knowledge I got from it. If only all manuals ...

    3. Re:behind-the-scene purpose by Anonymous Coward · · Score: 0

      Hi,

      Does the book cover the Boost C++ regular expressions library?

      Thanks

    4. Re:behind-the-scene purpose by ./ · · Score: 1
      Jeff, I've always wanted to build a regex that extracts your first name from your initials: (in Ruby)
      "Jeffrey E. F. Friedl" =~ /^[^J]*(.)[^E]*(.)[^F]*(.)[^F]*(.).+$/

      => JEFF

      Thanks for the amazing first edition. It sits next to TCP/IP Illustrated in the highest place of bookshelf honor.

      Off to beg for funds to pay for the 2nd edition... :)

    5. Re:behind-the-scene purpose by ProfKyne · · Score: 2

      I wrote the article so that people would have a feel for what's new in the book.

      As with almost every other programmer out there, I agree that "Mastering Regular Expressions" is one of the best-written and most useful programming books there is. I know a lot of people would probably buy the second edition regardless. But the article/book review cemented my decision, since it covers Java and PHP (and even that wacky MS stuff, huh?).

      --
      "First you gotta do the truffle shuffle."
    6. Re:behind-the-scene purpose by jfriedl · · Score: 1
      In case anyone still reading, I've gotten up a web site for the book that has some things that may help you decide if the book is for you or not (full index, table of contents, etc.).

      http://regex.info

      Jeffrey

  25. Opps there goes another rain forest by Mrs.Trellis · · Score: 0

    I own a large number of O'Reilly books and on the whole I think they're great, but I'm loathed to buy books when they include alot of material that I'm not interested in .NET VB Java etc. I'm sure it works the other way too, a MS developer probably doesn't give a monkeys about awk or python, so why do they have to devote whole chapters to stuff you're not likely to use? Waste of paper of paper & money methinks.

  26. Friedl's book is a must read for Perl folks by Lumpish+Scholar · · Score: 5, Insightful

    It's not just a Perl book, but the language independent and Perl dependent parts are a godsend.

    I was a full time Perl programmer (with a two hour commute by rail) when Friedl's book came out. I read it cover to cover, and then recommended it strongly to my co-workers.

    Friedl shows how to write powerful, readable, efficient regular expressions that can do a lot of the work your program needs to do. It changed how my group wrote Perl (very much for the better). This is more than highly recommended; after the Blue Camel, and even before the Cookbook, this is a definitive book for all those who call themselves "Perl programmers."

    (In the first edition of the book, Friedl discovered some problems with regular expressions in early versions of Perl 5. The very next release of Perl -- 5.003, I think -- immediately fixed these problems. When Larry & Co. pay attention to a Perl book, maybe you should, too?)

    --
    Stupid job ads, weird spam, occasional insight at
  27. What?! by Myuu · · Score: 2, Funny

    Mark me as a troll or whatever but, "What are regular expressions?"

    are they those /:+[^:]/ statements? whats the big deal then?

    I'm really, really new to perl, studying it out of an O'Rielly book. What does this mean to me?

    --

    forget it.
    1. Re:What?! by Anonymous Coward · · Score: 0
      Mark me as a troll or whatever but, "What are regular expressions?"

      are they those /:+[^:]/ statements? whats the big deal then?

      I'm really, really new to perl, studying it out of an O'Rielly book. What does this mean to me?

      Nothing. The dot-com thing is over -- go back to your English Lit. Ph.D.

    2. Re:What?! by cant_get_a_good_nick · · Score: 2

      If you don't know, I guess O'Reilly has another book sale...

      regexps are a very powerful search/replace tool. One of the reasons Perl is so popular is it has a powerful, easy to use (and by this, I also mean easy to invoke, evry try this in C, yeeesh) regular expression parser. Makes text processing very easy.

      If you're learning Perl out of the Camel book, you'll be fine. It has a good explanation of it. Once you see the power of it, you'l like wonder how you got along without it.

    3. Re:What?! by jbolden · · Score: 3, Informative

      The major reason to learn Perl is powerful string manipulation. Those "those /:+[^:]/ statements" are the power string manipulators. Try to do anything hard with strings in any language without regexes then you'll understand what the big deal is.

    4. Re:What?! by Anonymous Coward · · Score: 0

      A regular expression is a grammar that recognises some family of strings.

      Your concrete syntax may vary, but here goes:

      0 is the regular expression that matches the empty string;

      c is the regular expression that matches the character c;

      if r and s are regular expressions then
      - (rs) (concatenation) is the regexp matching r followed by s,
      - (r|s) (alternation) is the regexp matching r or s, and
      - (r)* (Kleene closure) is the regexp matching zero or more occurrences of r.

      You can do various things with regular expressions such as take their complement or intersection. Every regular expression has a corresponding finite automaton (aka finite state machine) and vice versa.

      The power of regular expressions is limited: for instance, you can't use them to spot palindromes or perform bracket matching (see the Pumping Lemma.)

      Most regexp libraries have convenient abbreviations and extensions, e.g. typically . matches any character except newline, [...] matches any character in ..., [^...] matches any character not in ..., ^ and $ match the start and end of a line respectively, and so forth.

      Regular expressions are typically used to find substrings matching a particular pattern.

      Quite how an entire book (let alone many books) is needed to cover the subject is beyond me.

      - Ralph

  28. well.. by joeldg · · Score: 1

    Time to bust out the books and get cooking again.. I still need to catch up on *current* regex.. They keep changing some of the functions around in PHP as well also (anyone else had this issue?) which break code when you switch from one box to another. Anyway... This looks interesting..

  29. heh by Anonymous Coward · · Score: 0
    No. XML is not regular. It's context-free.

    AUEOUHAEHORDEU waiting for 20 seconds to elapse la la la la la la thank you Slashdot this is a very good use of my time EHOEUHEOre sou Eu urceu EC Preo U SOUCHPe rOE.

  30. seriously, the article was boring by Anonymous Coward · · Score: 0

    "Java has regexes now"? Why did he need 2 paragraphs to say that?

    1. Re:seriously, the article was boring by Anonymous Coward · · Score: 0

      Because it's hard to say "You should buy this book" and have it be meaningful.

  31. what? by Anonymous Coward · · Score: 0
    They're not exactly the same thing, but they're certainly related to one another. Regexes used in practice are not "more powerful" than the kind Chomsky described (depending on your definition of "powerful"): there's nothing you can compute with a Perl5 regex that you can't with a finite automaton.

    People come up with different implementations and change syntax slightly now and then to make some things more convenient to write, but it's all the same fundamentally.

    1. Re:what? by Fizgig · · Score: 2, Interesting

      A Perl "regular expression" is more powerful than a mathematical "regular expression." Perl's can do backtracking, which a finite automaton can't do.
      The Perl "RE" "(a+)b\1" will match aba and aaaabaaa, but not abaa or aaba.

    2. Re:what? by jhunsake · · Score: 1

      Then Perl's use of the term "regular expression" is a misnomer. This post makes the correct analysis of the situation. Those who came up with Perl's "regular expressions" should have thought of (or invented) a more accurate term.

    3. Re:what? by Fizgig · · Score: 1

      Well, they were originally from grep (general regular expression processor), which, AFAIK is equivalent to a finite autamaton. Perl/awk/sed adopted these, and then extended them. It would have been kind of awkward to start calling things which used to be regular expressions something else halfway through the evolution of a tool. But yes, if you want to be pedantic, Perl's "regular expressions" are too powerful to be called by that name.

  32. A different look at string scanning by Anonymous Coward · · Score: 1, Interesting

    Way back when there was a programming language called "Snobol". It still lives (www.snobol4.com for a good starting point).

    Snobol is *THE* string pattern matching language. Nothing else beats it (and I've been playing around with string processing languages for over 20 years).

    Yes.. it's syntax is different and the language hasn't changed in years (decades?). But it does the job exceedingly well.

    You might also want to take a look at the Icon programming language (www.cs.arizona.edu/icon).

    Icon was developed by some of the same folks that developed Snobol. While not quite as powerful as Snobol in terms of expressing patterns, Icon extended some concepts. You can build up your own pattern matching functions.

    One of the best quotes I saw in an discussion concerning Icon and regular expressions (the discussion was that Icon lacked a builtin regular expression facility) was

    "Putting regular expressions into Icon would be like putting training wheels on a Harley" -- (I really wish I could remember who said that).

    Anyway... just something you might want to check into.

    1. Re:A different look at string scanning by Anonymous Coward · · Score: 0

      And I suppose K is the next great programming language and is so perfect that no one wants to use it?

      There is a reason why some things are used and others not, especially for things that have been around for a reasonable ammount of time.

    2. Re:A different look at string scanning by Anonymous Coward · · Score: 0

      I did some work in SNOBOL back in my undergraduate days at the People's Republic of Bezerkley. I used to tease people that in SNOBOL, the two most important operators were the blank. A crucial part of SNOBOL was "The Pattern Matching Algorithm." But it was only implied by the string which was the pattern. It was my impression that ICON was the author's recantation of his sins in devising SNOBOL. He made the PMA subject to the normal techniques of IF THEN, repetition, etc.. But my days of a language a day keeps the draft board away were gone, so I never learned ICON.
      Ross

  33. Re:Now, if only Google would support regexp search by platypus · · Score: 2

    Well, maybe they thought about it. But if they only implement a (non-trivial) subset of regexp search I will admire them even more.
    Regexp are horrible from a complexity point of view.
    According to this link regepx's complexity is of O(M*N), where M unfortunately is in the order of Googles DB, if my short calculation is correct. Note, this may be wrong, but the point stays that regexp searching is quite expensive and kills most of the optimizations you could do if you didn't want to provide them.

  34. Re:Now, if only Google would support regexp search by quasi_steller · · Score: 2, Insightful

    The problem with regular expressions is that there are so many constraints. for example:

    1. \<John.+Doe\>
    should match:
    1. JohnBDoe
    1. JohnandDoe
    1. JohnDoe
    1. JohnClark
    2. ...text...JaneDoe
    But this shouldn't match:
    1. "Doe Re Me," sang John
    1. "Jane Doe and John
    1. "John Doe"
    2. As you can see, even with a very simple regular expression like this, the text has to be processed a lot to get the results needed. A simple "John AND Doe" would match all of the results while the regular expression puts more restraints on the search, which takes longer to process. For complex regular expressions, the searching of text becomes too slow for large amounts of data, such as the internet.

    --
    ...interesting if true.
  35. Re:Now, if only Google would support regexp search by Anonymous Coward · · Score: 0

    "John AND Doe" _is_ a regular expression.

  36. VB and Regexes by Pinball+Wizard · · Score: 2
    From the article --

    Whether you love Microsoft or hate it, there's no denying the popularity of Visual Basic. With the regular-expression package in the .NET Framework, Microsoft provides a package that can be used by VB.NET, C#, Visual C++, and any other language that wants to link to it -- even Python and Perl! The consistency is appealing, but even more important is the package itself: it's powerful and fast, and can it can hold its head up high next to Perl or any other regex package out there.

    VB's regex syntax is exactly like Perl's. In fact, when I started working with regexes in VB and I couldn't find something in the documentation I would look it up in one of the O'Reilly Perl books. Much to my "shock", I could do everything Perl regexes could do, even the things that weren't in the documentation.

    I strongly suspect Microsoft took full advantage of Perl's "artistic license" when they came up with their regex engine.

    --

    No, Thursday's out. How about never - is never good for you?

    1. Re:VB and Regexes by Anonymous Coward · · Score: 0

      Nope :)

      We used PCRE. And they get props in the smallprint of one of the readmes.

      Jeff C. [MS by day, linux by night]

    2. Re:VB and Regexes by Tarpan · · Score: 2, Insightful

      Heh.. isn't the whole point of posting as AC to be just that, anonymous. Then why the hell did you sign it? ;) (assuming you did and not some impostor)

  37. perl 6 is gonna change all this by millette · · Score: 4, Insightful
    Anyone here that read the latest perl apocalypse, #5 it was, knows full well the regex as we know and love them are out-the-window. The apocalypse is a large document, so I picked this page to give you a little idea of wants going to change. The pages before that mention all the warts that Larry wants to bury.

    I understand that Perl 6 isn't near being done, and that the "r" in "Perl" doesn't necessarily stand for "regex", depending on who you ask, but Perl will always have the greatest influence over what is called a regex. Or is that going to change with Perl 6?

    1. Re:perl 6 is gonna change all this by coleSLAW · · Score: 1

      You're joking that you love Perl 5 regular expressions, right? They are extremely ugly and difficult to parse. Perl 6 regular expressions will be incredibly nice. Just compare something like: s{:(.*?):}{split ";", \1}ge with s:e{: (.*?) :}{$(split ";", $1)} for clarity? Of course, if you are really enamoured with Perl 5 regexps, you can either use re perl5; globally, or for individual regexps, s:p5{PATTERN}{EXPR}

      --

      == I am not Me.

    2. Re:perl 6 is gonna change all this by millette · · Score: 1

      know and love them, that would count as an expression with an once for satire.

  38. positive number by jbolden · · Score: 1

    It means greater than 0. You use "non-negative" to include 0.

  39. Re:regex: wrong tool for the job by jbolden · · Score: 1


    I recently wrote a program taking a SCS data stream with embedded DJDE's that was intended to go through an SCS->3211 converter (that is the data contains both line printer commands and 3211 syntax command that get passed through) and converted this data into AFP. You bet I needed power regexes for this.

    You use regexes to write parsers and parsers can't treat their strings as just "data".

  40. Is that so? by ochinko · · Score: 2, Insightful
    MSDN Library [microsoft.com] is the best single reference for everything Microsoft.

    Well, I don't find it fair that you were modded as a troll. You may be just misinformed.

    I can tell you that _any_ decent *nix gives you complete knowledge of what is going on in your machine. Without having to look at source code, without having to go to some central repository of information.

    Now, press Ctrl-Alt-Del in your favorite Windows and take a look at the name of the services. Try to enter any of them in the MSDN search. What do you see? Do they tell you what that service does? How is it started? How can you stop it?

    Do you still praise MSDN so high when you see that they don't even tell you the basics?

    1. Re:Is that so? by ergo98 · · Score: 1

      The "basics"? The reality is that such information is seldom of any purpose whatsoever for a software developer (though I'm curious which service in particular you can't find: Of the Microsoft installed services, all will get multiple hits in the MSDN library, and of course the MSDN Library clearly documents how services are launched from HKLM\System\CurrentControlSet\Services, so the specific documentation is hardly necessary). Win32 provides APIs that you interact with, and the particular implementation on a specific Win32 implementation is largely irrelevant.

      Of course, the MSDN is for developers and it presumes a rudimentary "Windows Business OS' For Dummies" type knowledge: One should already know how services work and how to perform basic system administration.

    2. Re:Is that so? by ochinko · · Score: 1

      The "Black Box" approach towards a library, or a daemon works fine most of the time but what happens when you have to resolve conflicts? What happens when you want to shut down (even temporarily) a service or two but don't know which one? And even when your guess is right, you still don't know what other things that service provides?

      Whether you admit ir, or not, Windows developers are forced to treat the whole system as a Black Box. I find it a pathetic excuse to say: "We provide every information you need except for what we consider irrelevant for the developers. What you don't find in the MSDN clearly belongs to the Dummies category. You don't wan't to look as a dummy, do you? Now, stop asking!"

  41. Irregular Expressions... by cirby · · Score: 2

    Regular expressions are old hat. I'm much more interested in the advances in irregular expressions, as used in the old Firth and Pasquale languages.

    But, of course, everyone knows that a real coder uses irregexps in disassembly language.

  42. Re:Now, if only Google would support regexp search by Dausha · · Score: 1

    tough competitive force. . . . It's non-traditional, it's free and it's cheap", Steve Ballmer about Linux

    Hmm, methinks Steve is using cheap as in "of inferior quality or worth" or "contemptible because of lack of any fine, lofty, or redeeming qualities." Webster's. Personally I think the word is best applied to Windows.

    --
    What those who want activist courts fear is rule by the people.
  43. may I please? by cr@ckwhore · · Score: 2

    I'd like to take all my existing regular expressions and run them through regular expressions to turn them into new age regular expressions. Can I do this, or will the universe implode?

    --
    Skiers and Riders -- http://www.snowjournal.com
  44. and I just bought it by larry+bagina · · Score: 1
    Seriously! I first read this book about 4 years ago, from the University library. A month ago, I bought my own copy partly cause it's a good book, partly to increase to qualify for a discount at an online store.

    An essential book if you ever use perl, php, e?grep, sed, awk, vi, or a number of other programs.

    --
    Do you even lift?

    These aren't the 'roids you're looking for.

  45. Reminds me of those "original and best" adverts... by deepchasm · · Score: 1


    Regular expressions - making line noise useful since 1956!

    Julian

    (btw, it's a "Internet RFC standard compliant email address matcher")

  46. My request by God!+Awful · · Score: 2

    You know what I'd like? A regex syntax I could use in shell scripts that would take less time to debug than the equivalent C++ program.

    -a

    1. Re:My request by PythonOrRuby · · Score: 2

      Perl 6 is your language then. You won't be able to directly use it in shell scripts, but once you have Perl figured out, you won't mind that anyway. ;-)

      Having whitespace be insignificant by default should help a great deal with readability, as will the efforts to make regex syntax more consistent. The ability to embed Perl 6 objects into regular expressions should also lead to some interesting developments.

    2. Re:My request by realdpk · · Score: 2

      This isn't regexes but you might find it useful. In sh, at least on FreeBSD and I believe on Linux (bash) you can use ## and %% to strip out various parts of a variables contents. Such as:

      $ foo=bar
      $ echo ${foo##ba}
      r

      Very useful stuff..

    3. Re:My request by God!+Awful · · Score: 2

      I write a few shell scripts in Perl, mostly when I need a hash table, but because I don't use it often it still takes a long time to research and debug. The problem with scripting languages is that they are meant to be compact so the syntax is so crazy.

      I always want to do a simple search and replace in a shell script ala echo "$TEXT" | sed "s/$FILENAME/xyz/". But filename is bound to contain some control characters, such as '/' or '.'. I end up using "s,$FILENAME,xyz,", but every once in a while I still get strange results. Can Perl do any better?

      -a

    4. Re:My request by Anonymous Coward · · Score: 0

      Yes. s/\Q$FILENAME\E/xyz/ will do your substitute correctly no matter what funky characters are in $FILENAME.

      And if you don't want a crazy syntax, look at Ruby. In my experience more compact than Perl, but with a much cleaner syntax.

    5. Re:My request by PythonOrRuby · · Score: 2

      Python's re.py module also allows for escaping characters that have meaning in regular expressions, for what it's worth.

    6. Re:My request by God!+Awful · · Score: 2

      Thanks. That will come in very handy.

      -a

    7. Re:My request by whizzmo · · Score: 1

      in perl:

      #!/usr/bin/perl
      $filename=shift;
      $from=shift;
      $to=shift;

      unless (-e $filename) {print "$filename doesn\'t exist! Aiee!\n"; exit;}
      open(WHIN,$filename) or die "unable to open $filename for read!\n";
      @inlines=;
      close(WHIN);
      foreach (@inlines)
      {
      s/$from/$to/eig;
      print "$_";
      }
      print "Done!\n";

      Or, if you wanted to read from STDIN (as you mentioned)
      like this:


      some_cmd_name | perl snr.pl

      you could do:

      #!/usr/bin/perl
      $from="hot";
      $to="cold";

      while ( <STDIN> )
      {
      s/$from/$to/eig;
      print "$_";
      }
      print "Done!\n";

      Hope this answers your question.

      --
      nuclear presidential echelon assassination encryption virulent strain
      Whizzmo
  47. regexp criticism by Tablizer · · Score: 2

    (* Over the course of my career I have come to the rather firm opinion that you are not worth much as a coder if you do not know regular expressions. *)

    That can be said about anything. IMO, many OOP fans were simply crappy at procedural/relational programming and design (either due to lack of training, or a non p/r mind). The faults they often find with p/r are their own bad thinking about p/r, and not OO's strengths.

    I think reg.ex's would be easier to learn and read and remember if they were broken down into user-definable chunks of some kind. It could be more like defining a generational grammer (substitution): you define the symbols rather than live with what Larry Wall or whoever picks. A special set of functions or operators would simplify the defining of the symbol sets.

    Further, I would like to see the peices parsed into a table (or some easy-to-navigate structure) so that second passes can be done. In other words, divide up per-character parsing and per-token parsing.

    I admit that it may not be as compact as regexp's, but easier to read for those don't need it every day.

    Regular expressions come across as a stringy diarreac glob of an irriducable mess of symbols if you don't keep up. It is like forgetting to ride a bicycle if you do not do it every 3 months or so to refresh.

    I realize that everybody is different, and what bothers me may not bother others. I just don't personally like the approach resexp's took. I would like to see it broken down into clearer chunks. IOW, the syntax would (clearly) dictate the chunks instead of running the rules in one's head to find the boundaries and context.

    I know I will get called a bunch of names for saying this all, but that is my opinion, take it or leave it.

    1. Re:regexp criticism by Tablizer · · Score: 2

      Here is kind of an example of what I am
      envisioning.

      The Perl version comes from:

      http://txt2regex.sourceforge.net/

      ### date LEVEL 3: mm/dd/yyyy: matches from 00/00/1000 to 12/31/2999

      RegEx perl: (0[0-9]|1[012])/(0[0-9]|[12][0-9]|3[01])/[12][0-9] {3}

      My re-work of it:

      symb(h, "A", Symb_numRange(1,12));
      symb(h, "B", Symb_numRange(1,31));
      symb(h, "C", Symb_numRange(1000,2999));
      isGoodDate = symb_Match(h, checkMe, "A/B/C");

      Here "h" is the symbol set storage handle.
      OOP langs would probably have it on the left side as an object.

    2. Re:regexp criticism by thrig · · Score: 3, Insightful

      Sounds kind of like what the Regexp::English perl module does.

      You may also want to look at the YAPE::Regex series of modules that allow parsing/extracting/explaining of regex.

    3. Re:regexp criticism by ProfKyne · · Score: 2

      Regular expressions come across as a stringy diarreac glob of an irriducable mess of symbols if you don't keep up. It is like forgetting to ride a bicycle if you do not do it every 3 months or so to refresh.

      Not to pick nits, but the expression "it's like riding a bicycle" implies that once you learn how to ride a bicycle, you never forget, no matter how long you go without actually riding one.

      --
      "First you gotta do the truffle shuffle."
    4. Re:regexp criticism by Tablizer · · Score: 2

      (* Not to pick nits, but the expression "it's like riding a bicycle" implies that once you learn how to ride a bicycle, you never forget *)

      Well, I am suggesting that it is *not* like a bicycle. The rules and symbols don't "stick" very long if you don't use regex's very often. At least not in my head.

      Actually a few years ago I tried riding a bicycle after about a 10-year absense. I almost fell over because my weight distribution was "different"[1] later. My brain did not know how to balance the new weight.

      [1] Euphemism for "fatter"

    5. Re:regexp criticism by ProfKyne · · Score: 1

      I almost fell over because my weight distribution was "different"

      Right there with ya buddy.

      --
      "First you gotta do the truffle shuffle."
  48. Yes Indeed by PatientZero · · Score: 2
    Regular expressions is one of those tools that I end up teaching to anyone that doesn't know them whenever I start a new job. I don't use them in much of my applications, but I use them to write my applications and build tools. I follow the philosophy of building tools to solve problems knowing I'll need to solve the same problem again and again.

    Another tool is shell scripting. At a past company Symantec Cafe was used for developing a Java application. When I joined, I immediately created shell scripts for myself to do automated builds for a couple reasons:

    • Cafe's editor, while nice, was not up to par for me -- it slowed me down too much.
    • I multitask a lot when I'm working, and having multiple shells open at once doing builds et al is handy.
    • The editor I use on Windows, CodeWright, lets you call batch files (and thus shell scripts through Cygwin) for CVS and compilation.
    • Cafe didn't (and still doesn't?) do automated builds, nor does it run on Linux.

    I showed others how to use them, but only one other developer took the time to get used to it, never having used a shell before. The others complained that they shouldn't have to learn a new tool (shells and scripts) when Cafe sufficed. I explained the advantages, but to no avail.

    Well, a few months later we finally hired a real QA and release engineer. Since we were building a J2EE application to run on Linux in testing and Solaris in deployment, we needed automated builds on Unix. There was a huge rush to get everyone up to speed on the new build system using shell scripts.

    Hmm, that was a bit long-winded just to make the point that there are many useful tools to developers that don't involve the actual code they write. I've used regexps to create SQL data files and config files as mentioned. You'll learn many things, so keep open and don't stop learning. :)

    --
    Freedom to fear. Freedom from thought. Freedom to kill.
    I guess the War on Terror really is about freedom!
  49. Regex Accelerator! by Anonymous Coward · · Score: 3, Informative

    For the ultimate in regex'ing ... hardware regex accelerators!!!

  50. I diasgree completely. by lars · · Score: 1

    If you have a brain, none of the things you mention is a "basic skill". They are merely trivial implementation details. The only basic skills for a programmer are a) problem solving skills, and b) comunication / interpersonal skills.

    So don't say you know how to write regular expressions and expect anyone to care. It's really no more important than knowing the Windows API, for instance. In either case, any programmer worth his/her salt can learn the required details with minimal effort. Now, if you can implement regular expressions, then I might be mildly impressed.

    1. Re:I diasgree completely. by bellings · · Score: 2

      Now, if you can implement regular expressions, then I might be mildly impressed.

      Who the hell can't implement regular expressions? It's pretty damned easy to do -- Kleene's Theorem isn't exactly rocket science.

      --
      Slashdot is jumping the shark. I'm just driving the boat.
  51. Re:Reminds me of those "original and best" adverts by BJH · · Score: 1

    And it doesn't help you one bit, since you've got no way of verifying whether that mail address actually exists, even if it is RFC-compliant.

  52. Extra Boat Payments now? by SN74S181 · · Score: 1

    When I bought 'Mastering Regular Expressions' I figured I was buying a book that wouldn't go out of date. Like all the classic O'Reilly books, it wasn't something trendy, it didn't have a software version number in the title....

    Something makes me suspicious that the author just bought a new power boat and 2nd Edition buyers are paying for it.

  53. Goes to prove things never change for the better by Anonymous Coward · · Score: 0
    After reading the article, I came away with the impression that what is "new" in regexps is only a lot of frou frou and junk. It had little to do with regular expressions per se, but with all the new junk "technology" and half-baked scripting languages which have appeared in the last couple of years.

    If you are interested in regular expressions, better grab a copy of the old edition before it becomes unavailable. The new edition is not so much about regular expressions, upon which it hardly touches, but rather on the latest buzzword compliant scraps from the www rag pile.

  54. K shall rule the world by Jayson · · Score: 1, Offtopic

    when people become intelligent enough to use it and Arthur finishes K4. Watch Kuro5hin within the next month for an Introduction to K to appear (I will also submit it here, but I doubt it will get posted).

  55. Re:Now, if only Google would support regexp search by larry+bagina · · Score: 1

    that link only considers non-deterministic regexps. If you don't need backreferences, you can do deterministic pattern matching, which is inexpensive once the regexp is compiled into a lookup table. I think ACM from a couple months ago had an article on some research into deterministic regular expression matching with limited backreference support. Anyhow, if you read the google white papers and the founders' graduate student research on searching, it's fairly clear they do hierarchial keyword indexing, which is good for fast lookups, but the data isn't well formatted for regular expression processing. You could write a tool to use the google api to do a preliminary search on constant words in your regexp and then have your client do a regexp search on the results. Hmm, I may have to try that for the next google programming challenge :) (GNU egrep does something similar by doing an fgrep on static terms before doing the full-blown regexp search)

    --
    Do you even lift?

    These aren't the 'roids you're looking for.

  56. Re:Ewww... by Anonymous Coward · · Score: 0

    Yeah, and so what if its flamebait!?!?! Better then being a .NET use'n flamer, or one of those poor SOBs that got sucked into java, sun java or MSjava, or whatever the hell it is today. Talk about a waste of time. That language is responsible for far too many bugs where I work, and its gotta be a hundred times better then .NET.

    Hell, I'ld use VB before either .NET or java, and given the crappy nature of VB that says alot. If you gotta make web apps, use the good ol duct tape known as Perl, and if you can't do something with it struggle with java. Some of it works well. However, it tends to be only in what Perl cannot do.

  57. How is this a troll? by Anonymous Coward · · Score: 0

    Fucking mods...I suppose none of you modding this as a troll have written any Win32 code/used MSDN at all. I'm not a huge Microsoft fan, but I really am forced to admit that they do a very good job documenting their APIs. Yet more misinformed moderation based purely on assumptions and speculation. No wonder no one comes here anymore...

  58. regexp are way overrated by The+Cookie+Monster · · Score: 5, Informative
    I know and use regular expressions, but use of regular expressions is often symptomatic of poor design, this makes me somewhat suspicious of those who live and breath regexp's and preach them to the world.
    • Text processing - why isn't your text marked up? Converting data into text, passing it along, and then trying to pluck the data back out of the text is brittle and leaves you with a system that can't be upgraded - your components can't be improved to produce a more informative text stream as it will break all the regexpr's of all the components that use that stream etc.

      Text straight from the keyboard of a user won't be marked up and seems a good place to be using regular expressions. Due to the popularity of brittle and unupgradable (is that a word?) text processing, the input from other programs might not be marked up either, here regexprs are necessary (ie symptomatic of poor design, but it wasn't your decision).

    • Parsing - how many times have you encountered a HTML or XML parser written with a regexpr? Unless your job requires you code by the seat of your pants, this is just plain lazy. Parsers written with regular expressions are always incomplete (ie they work on the subset of HTML/XML they were tested on, and if the requirements or layout ever changes they break), and they are very slow compared to a proper parser. Proper robust and well tested parsers are available under most licenses and for most languages.

      This applies to much more than just HTML or XML, eg if you're going to write a javadoc clone for your pet language, do it properly, don't do it with regular expressions.

    • Development - Regular expressions appear to be developed with a 'try it and see' methodology - people write the regexpr and test it, thinking if it works then they must have done it right. This is very brittle, I've ecountered many regexpr's for email addresses, all of them work on your bog standard address, none of them work when deployed - there's always some guy with a % in their email address or some other oddity the author of the regexpr forgot or didn't know about (and lets not even think about trying to make an RFC compliant email address regexpr, it would have to handle "blarg@wibble"@slashdot.org)

      That HTML tag stripper you hacked up, did you remember to handle comments? Just because there weren't any comments in the HTML it was tested on doesn't mean it'll never encounter them in the real world (wouldn't be an issue if an off the shelf parser is used).
    I don't know, there are other issues with regexpressions but I've spend too long on this post already. I'm curious as to other's views on this - I've just come to associate use of regular expressions with flakey or hastily written software.
    1. Re:regexp are way overrated by dragonsister · · Score: 1
      I've ecountered many regexpr's for email addresses, all of them work on your bog standard address, none of them work when deployed - there's always some guy with a % in their email address or some other oddity the author of the regexpr forgot or didn't know about (and lets not even think about trying to make an RFC compliant email address regexpr, it would have to handle "blarg@wibble"@slashdot.org)

      Actually, my beloved did exactly that for his work. Built a fully RFC compliant regexp for finding out whether or not something was a valid email address, in Perl, by putting together bits of regexp according to the RFC. The assembled regexp (which can be displayed for debugging purposes) takes a full screen (or was it two?). I think the problem you're complaining about is that people get caught writing simple regexps for things that have potentially complex structure. Inadequate or incomplete testing or specification plagues most varieties of programming.

      Rachel

    2. Re:regexp are way overrated by Anthony+Boyd · · Score: 4, Interesting
      Text processing - why isn't your text marked up?

      While you later concede that form input and input from other programs might be good reasons to use a regex, that you would even pose this question is strange. For 90% of the regex fans, form input and screen scraping is exactly what they do. For almost any Web developer, this is the day-in, day-out norm. So your point seems to downplay the very uses that have made regex's so popular.

      I've ecountered many regexpr's for email addresses, all of them work on your bog standard address, none of them work when deployed

      You realize this does not bolster your claim that regex's are "overrated" -- it merely points out that some developers are overrated. A bad developer does not make a language bad.

      That HTML tag stripper you hacked up, did you remember to handle comments?

      Same as above. You're complaining about human error and then blaming the regex system itself.

      I've just come to associate use of regular expressions with flakey or hastily written software.

      Of course. But the hastily written software is the other software we interact with, not our own. And that's a broad generalization for many developers, so of course you can find exceptions. But you asked for other people's views, and in my view, regex's are sorely needed -- not so bad developers can stay bad, but so that the good developers can clean up the messes left behind after the bad developers go. It's a nice bonus that good regex developers can pull in hostile data, screen scrape, and cleanse form input. That helped one of my employees get a raise last quarter.

    3. Re:regexp are way overrated by Glorat · · Score: 2
      I'm curious as to other's views on this - I've just come to associate use of regular expressions with flakey or hastily written software.

      Hehe, ok, I'll be objective but some personal opinions reign. Must of this is from my personal experience, not text book stuff

      Text processing - why isn't your text marked up? Text processing forms the heart and soul of regexps. As you say, any brainful system should never pass text requiring regexps between systems (use markup, structs, whatever). However, at some point, there is usually raw input beyond your control, be it CGI input, keyboard input, non-markup input from a system beyond your control. That's where regexps are used the most (all of ?) the time for me.

      Parsing - how many times have you encountered a HTML or XML parser written with a regexpr?

      Parsing is the next level beyond regexps. You start with the specificatio and let the implementation arrive from it, like any much good development. Indeed, any "parsing" of large well specified documents (XML, HTML etc) are probably best done by proper parsers. But sometimes, you don't have well specified input at all, or you are just searching for bits out of a document. Now we are back to adhoc text processing where regexps rule. Also, parsers are overkill when we are doing small processing such as reading numeric input.

      (My IMHO) Conclusion: There is some grey (and for me, not a thin) line between text processing and parsers, where you should use regexps or not.

      Development

      A good regexp programmer knows what he is regexping for before he starts. I invariably get things right first time. That they try to parse something that has a specification (email address) without reading the RFC is stupid.

      Now here is the distinction. If something is well specified, there is invariably a perl module to handle it using whatever optimum (hopefully) method is available (XML::Parser, Email::Valid). Regexps are where we are not dealing with standard specifications, perhaps non-formatted data and thus where parsers may not work. And in those cases, withour regexps, you'd be in a very lost world and that's perhaps why they are preached so much.

    4. Re:regexp are way overrated by The+Cookie+Monster · · Score: 1
      While you later concede that form input and input from other programs might be good reasons to use a regex, that you would even pose this question is strange. For 90% of the regex fans, form input and screen scraping is exactly what they do. For almost any Web developer, this is the day-in, day-out norm. So your point seems to downplay the very uses that have made regex's so popular
      point and point.

      What I was trying to justify there, was that when you need to use a regexp it may be a subtle hint that there's a design problem somewhere in the system, I didn't mean to imply they had no uses.

      If each time you feel an urge to use a regular expression you ask yourself 'Is this a clue that something is wrong?' then with those examples you can go "no I'm dealing with form input", or "yes, but it's out of my hands".

      In the later case your regexpr alarm-bell has at least made you think about that fact that you're about to tie your software to a specific version of a specific program.

      They are most definitely useful little beasties and in that sense my reply was somewhat off the topic it was replying to - I too would have to question the experience of a coder who didn't know regular expressions.

      You realize this does not bolster your claim that regex's are "overrated" -- it merely points out that some developers are overrated. A bad developer does not make a language bad.

      Yeah, I think email addresses were a bad example to use - the problem is really that they look deceptively simple, rather than anything to do with regular expressions.

      The point I guess I was wanting to make there was that if you have something that is defined by a grammar of tokens then it's better to parse it with something that works that way. Normally you can take a parser off the shelf, but even if you can't it's much easyier to get it right first time and forever when you can code the grammar straight from the specification, as opposed to regular expressions which I believe encourage you to look at what the input normally looks like and construct some pattern to approximate it. (Then confirm the regexpr is correct by testing it with the tiny subset of inputs that you based it off in the first place)

      That HTML tag stripper you hacked up, did you remember to handle comments?
      Same as above. You're complaining about human error and then blaming the regex system itself.
      Nope, the 'human error' would never have occurred if an html parser had been used (the proper tool for the job) instead of a regexpr, they're not that much harder to use (or find).

      But you asked for other people's views, and in my view, regex's are sorely needed.
      Yep, thanks, and I'm not going to disagree with you there. :)

      However, I think that if we portray regular expressions as a sometimes neccessary kludge (well, often neccessary, depending on what you do), rather than something that seperates the uber coders from the novices, then people will think to question what it was that resulted in them wanting to solve a problem with one, and learn from the design decisions of others.
    5. Re:regexp are way overrated by sh4de · · Score: 1
      I've ecountered many regexpr's for email addresses, all of them work on your bog standard address, none of them work when deployed - there's always some guy with a % in their email address or some other oddity the author of the regexpr forgot or didn't know about (and lets not even think about trying to make an RFC compliant email address regexpr, it would have to handle "blarg@wibble"@slashdot.org)

      Writing an RFC compliant email address regex is possible but not too feasible. In Jeffrey Friedl's book, there's such a beast. At over 2000 bytes (IIRC), it may very well choke some of the less robust regex engines.

      However, validating an email address can never be done with a regex. You can't tell programmatically if the supplied address is deliverable, and that's what matters in most cases.

      In my CGI scripts, I never make assumptions about how valid an arbitrary email address is. Instead, the script sends mail to it and expects a reply. Only then can I tell the address is indeed valid.

      You can't even assume that a valid address has the '@' character present -- it may be a local address (not that usual though), or a bangpath (even more rarely), but these illustrate the fragility of regex validators.

    6. Re:regexp are way overrated by ProfKyne · · Score: 2

      I know and use regular expressions, but use of regular expressions is often symptomatic of poor design, this makes me somewhat suspicious of those who live and breath regexp's and preach them to the world.

      I find regexes to be very useful for checking user input in HTML forms. You can do a JavaScript regex check for the user's convenience (so that s/he doesn't need to submit the form to find out that s/he made a mistake or invalid input), then a second check on the server side with whatever server language you are using.

      Skip the JavaScript if you're lazy or in a hurry.

      --
      "First you gotta do the truffle shuffle."
    7. Re:regexp are way overrated by RexRuther · · Score: 1

      I think it is not a problem for regexps being frail and more of a problem that email addresses were allowed to take on so many forms.

      --
      -"The early bird catches the worm, but the late bird sleeps the most"
    8. Re:regexp are way overrated by Ed+Avis · · Score: 2

      I think that your criticisms are criticisms of *string processing*. Indeed, if you are spending most of your time munging strings, you might consider whether a better interface is needed. For machine languages like HTML or C code, you should normally use a parser rather than ad hoc string processing.

      But a lot of stuff does inherently require messing with strings, and for that, the regular expression is a great general-purpose tool. It certainly beats the raw C library :-P.

      --
      -- Ed Avis ed@membled.com
    9. Re:regexp are way overrated by Moeses · · Score: 1

      In the first version of Mastering Regular Expressions contains an RFC complient email address regex. I hope your beloved realized this was a solved problem with a public domain solution before spending an hour or more on development.

  59. Porn by Anonymous Coward · · Score: 0

    Think about it: when you are looking for wares or porn, where do you go? Perl? Nope. IRC.

    I wrote a perl program to go get porn for me while I sleep.
    I am the saddest person on earth, but at least I have a lot of porn.

  60. Re: ACs and imposters by Abreu · · Score: 2

    Maybe he is an impostor who is trying to get Jeff fired... I mean, would YOU want employees that are willing to spend part of their free time improving the competitors product?

    I mean, its perfectly legal, but the point is: How would this look in the Evil Human Resources Dept? Are they going to think about promoting this guy in the near future?

    --
    No sig for the moment.
  61. Wow by Junky191 · · Score: 1

    This is incredibly fascinating. I honestly don't know if I'd rather read this article or study up on Boron.

  62. Well written by wiredog · · Score: 2
    what [struck] me the most, is that it is well written.

    Which is why O'Reilly is the first place I look for a book. Ther ratio of well/badly written books is better there than anywhere else. The only books I will order online. All others, I want to page through them in a bookstore first.

  63. Can some body mark parent up as informative? by invalid_user · · Score: 1

    thanks.

  64. Lexer != parser by Anonymous Coward · · Score: 0

    Formal languages forever!

  65. Re:PeRl sux by AriesGeek · · Score: 1

    Sure, if you're only making webpages.

    --
    Insert offensive troll-style sig here. Please mod or respond appropriately.
  66. Re:Now, if only Google would support regexp search by tzanger · · Score: 2

    I'm sorry, but

    <John.+Doe>

    Should not match "JohnDoe", and should match "John Doe". you need one or more characters between John and Doe in that regexp.

    John AND Doe doesn't do shit for you in search engines either. I like the NEAR clause when I am searching for information because I often have to find things like "scanPORT specification" and I end up getting pages talking about a module with scanPORT and the specifications for the module, instead of for scanPORT. Having a NEAR clause or even a <scanPORT.{,30}specification> would help.

  67. Re:Do we need complex acronyms? by poopbot by Anonymous Coward · · Score: 0

    >Basic marketing 101 (and an undergrad course in psychology) would tell them that the normal person is only capable of remembering approximately 7 items of data in their short-term memory, but now we have to remember HTTP, HTML, XML, XSL, DTD, PHP, SSL, DSL, ADSL, ISDN, Perl, etc etc etc

    Um, basic psychology would tell you that HTTP, HTML, XML, etc are only one chunk each for most of us, and in fact, I'd personally chunk html and xml together, too. Oh, and the 'magic number' was 7 _plus or minus two_.

    Miller, G.A. (1956) The Magical Number Seven, Plus or Minus Two: Some Limits on our Capacity for Processing Information Psychological Review, 63, 81-97.

  68. Re:Contentless article on contentless article by budalite · · Score: 1

    Ah, the "everyday" programming of the "everyday programmer". I would like to see a poll of "everyday programmers" that asks "How often do you use regular expressions?" I would bet a small bundle that most "everyday programmers", which in the real world are mostly business application developers, I think, rarely use regular expressions. Doesn't help much when getting customer info. into or out of a db. It sure doesn't seem to help much in today's "Search engines" that only seem to give me someone else's business card, rather than real info on the subject I've asked about. (Rant pause...)

    Your pet peeve is probably your worst personality trait.

  69. Email regex by neves · · Score: 1
    In my CGI scripts, I never make assumptions about how valid an arbitrary email address is. Instead, the script sends mail to it and expects a reply. Only then can I tell the address is indeed valid.

    So you miss the chance to build a good user interface. If you verify the email address with a regex, you could present a decent error message to the user that has just made a typo. Maybe he typed "username@hotmail.co", so before sending him an email, you can right on ask for the correct address. If the user leaves your site, you won't be able to contact him anymore.

    I go further, even checking for common errors and suggesting the correct address. Did the user entered "username@hotmail.com.br" or maybe "www.user@isp.com", they I warn that the address is probably incorrect and suggest "username@hotmail.com" and "user@isp.com" even before accepting the input.

    An user email is a valuable asset. The first step building a long relationship. Don't miss the chance to get it right on the first time.

  70. Favorite use of regexp: by Anonymous Coward · · Score: 0
    sub monkey { my $buffer = shift; $buffer =~ s//monkey(getpostext($1))/egs; return $buffer; }

    Utterly simple, but when I tried to rewrite it in C it took me like 200 lines of code, probably more. LONG LIVE THE MONKEY!

    1. Re:Favorite use of regexp: by Anonymous Coward · · Score: 0
      Argh... that's supposed to be

      sub monkey { my $buffer = shift; $buffer =~ s/<%(\w+)%>/monkey(getpostext($1))/egs; return $buffer; }

  71. Perl regex and the Chomsky hierarchy by rp · · Score: 1
    Actually, a smartass points out that you don't even need Perl 5 to go *beyond* type 2.

    Backreferences (in Perl 4 and some other regex libraries, but not in sed or awk), can express things like

    { xxx | x in \Sigma }

    which is not a type 2 language:

    % cat input
    ab
    abc
    abcabc
    abcabcabc
    abcabcabcabc
    abcabcabcabcabc
    % perl -lne '/^(.+)\1\1$/ and print' input
    abcabcabc

    In Perl 5, assertions provide another form of context sensitive matching.

    Meanwhile, I think you're right that Perl 5 regexes cannot express all of type 2,
    unless you cheat by using (?{ code }).

  72. Jeez, that had to hurt! by Ashurnasipal · · Score: 1
    Six months ago I was handed a printed copy of our family
    Holy Toledo. How big was the xerox machine? Was the lid open, or did you all have to be forced through the document feeder?