Slashdot Mirror


Java Regular Expressions

Simon P. Chappell writes "Regular expressions (regex to their friends) are an incredibly powerful addition to most programmer's personal toolkit of techniques. Programming using a language that doesn't support them can be frustrating if you need to do any amount of non-trivial string handling. Java was just such a language until the release of the 1.4.x series. Sure, there were libraries like ORO that would provide regex support, but it wasn't built in and not many companies allow the use of 3rd party libraries. With version 1.4.x, the corporate Java developer in the trench, received the power of regular expression pattern matching." Read the rest of Simon's review. Java Regular Expressions author Mehran Habibi pages 255 (7 page index) publisher Apress rating 8/10 reviewer Simon P. Chappell ISBN 1590591070 summary A great starter for using regular expressions in Java

The book seems targeted towards those who have a solid level of Java programming skills, but who have not yet used the java.util.regex package. I see two types of Java programmers who might not have used the regex package, those who do not know about regular expressions and those who know them, but have not yet used them within Java. This book should satisfy both sets of users. The first group will be benefited by the general introduction to regular expressions and the gentle introduction to using them within Java. The later group will benefit from the more advanced material in the book.

The book is nicely structured and progresses easily through its subject matter. The first chapter is an introduction to regular expressions. While this is most obviously for the readers new to the subject, it will be useful for those more experienced, because not all regex engines are created equal and this chapter lays out the particular dialect of regular expressions used by the Java 1.4.x regex engine. The second chapter introduces the object model used by java.util.regex. This gives detailed explanations of the Pattern and Matcher objects as well as the new regular expression methods added to the standard String class.

The third chapter takes the reader into advanced Regular expressions. While there is much that can be done using just the Pattern and Matcher objects, the path to the full power of regex travels through an understanding of groups (and subgroups) and qualifiers. Regex groups are hard to explain until you've seen them in action, whereupon you may find yourself wondering how you'd ever managed without them before. Mr. Habibi does an excellent job, both explaining them and introducing us to the unusual noncapturing subgroups. (I'd never heard of these before.) Qualifiers are the other side of the same coin with groups. While it's one thing to define a group and whether it's expected and to be captured, it's equally important to be able to describe the expected occurrence of those groups using qualifiers.

Chapter four tackles the interesting challenges of using regex in an object-oriented language. Mr. Habibi describes the general principles of use of regex as similar to those used with SQL through the JDBC interface. These principles are the optimisimg of connections, batching reads and writes, storing patterns externally, Just In Time compilation of patterns and remembering that not every piece of String handling code needs to be written as a regex. All very useful advice.

Chapter five is the big examples chapter. All of the examples are intended to be practical; the kind of thing you might have to address at the day job. With examples covering Zip codes, telephone numbers, dates, searching text files and even validating an EDI document, he seems to have delivered on that assertion. There are further examples in Appendix C, if the afore-mentioned patterns aren't enough.

The writing and progression of material are good. The examples are very well thought out and explained. Many of the examples are built from first principles. Mr. Habibi seems to want to not only teach you how to use regular expressions, but also how to design them. He does this by working up from an understanding of the data until he has a working regex.

While it doesn't make any promises about being an encyclopedia of regex patterns, this book does contain enough of the normal business patterns to be a useful initial reference work, before turning to the Internet to search for patterns.

If you want an encyclopedic reference work on regex, then buy Jeffery Friedl's Mastering Regular Expressions which is published by O'Reilly. This is not that book, preferring to stick with the practical usage of regex.

This is a great starter book, for developers who are new to using regular expressions in Java."

You can purchase Java Regular Expressions from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

35 of 181 comments (clear)

  1. When speed matters by SIGALRM · · Score: 3, Informative
    there were libraries like ORO that would provide regex support, but it wasn't built in and not many companies allow the use of 3rd party libraries
    For those who can utilize third-party libs, consider evaluating this DFA/NFA automaton, a regexp package that is significantly faster than java.util.regex.

    However, like many things in computer science, speed gains come at a price. In this case, the regular expression language supported is not quite as rich as the JDK implementation.
    --
    Sigs cause cancer.
    1. Re:When speed matters by Ryan+Amos · · Score: 3, Insightful
      When speed matters

      ...you don't use Java.

      (I know, let the flames commence! :)

    2. Re:When speed matters by The+Snowman · · Score: 5, Informative

      Here is the class I assume the parent is referencing: Formatter class.

      Essentially what happens is you don't have C-style varargs, the JRE silently creates an array for you when you pass the arguments. This doesn't waste "gobs of heap space" like the parent says, it uses the same amount as it would using the stack. Remember, these are objects, and Java never passes objects by value -- always by refence. So each argument wastes one machine word (usually 32 bits). Whoop de fucking doo. And, since it uses references, the only allocation/deallocation is the temporary array. And in 1.5 if not previous versions, this is very very fast. With a JIT compiler you'll hardly notice it. I do agree that the decision to make the class "final" is shitty, but honestly, I don't see how subclassing it would be a huge advantage. It would be like subclassing the java.lang.String class. Sure, you could add some nifty stuff, but it's not a big deal.

      As a person who earns his living off of J2EE, I know its strengths and weaknesses. I am not a fanboy, however. I am more than willing to give Java hell when it deserves it. I think string handling in general is not as well-organized or easy to use as it could be, but it is certainly capable. I rarely use sprintf() style string formatting anyway, even in C++. I find it much easier to use iostreams, which are typesafe and almost as fast as sprintf(). In Java I just use string concatenation, and the formatting classes when I need it. It isn't perfect, but it works well enough and sure isn't slow.

      --
      24 beers in a case, 24 hours in a day. Coincidence? I think not!
    3. Re:When speed matters by CompSciStud4U · · Score: 5, Informative

      I'll take the bait. When Java was introduced in 1995 almost all compiler research had been on static compilation, such as in C or Fortran. When the popularity of Java started to rise a lot of research effort, such as at IBM, was switched over to Just In Time (JIT) compilers. This was a pretty raw field at the time so the Java was horribly slow compared to C.

      Fast forward 11 years and the situation is quite different. I'm not sure about the Java compiler that comes distributed with the SDK, but a JIT compiler and virtual machine from another commerical sourse (I'll just stick with IBM) is now incredibly optimized compared to 1995. Large amounts of research have been done to catch up with the fact that statically compiled languages had a 30+ year headstart. And JIT compiled languages could one day be faster than a statically compiled one due to new dynamic compilation techniques that use system resource data, such as cache misses, collected by the VM to continuously reoptimize portions of the byte code.

      And even the overhead of garbage collection may soon be lowered dramatically due to research at the University of Massachusetts http://www.cs.umass.edu/~emery/pubs/f034-hertz.pdf

      I'm not going to say that Java is faster than C (or in this case Perl, a language specifically designed for parsing regular expressions), but the speed gap between the two is constantly closing to the point where it doesn't really matter that much anymore.

  2. Regular Expression? by silicon-pyro · · Score: 2, Funny

    Me: I'll have a Grande Cafe au Lait please.

    Starbucks Employee: That'll be an hour's wages please.

    Me: Thanks! /me hands over cash, takes careful first sip.

    Thats when you get to see my java regular expression.

    Generally it will be me wincing in pain because I just burned my tongue. Sometimes, if it's cooled enough, you'll hear a quiet "MmmMmmm" in the style of Family Guy's Herbert.

  3. Re:Recursion? by SIGALRM · · Score: 5, Interesting
    Regular expressions aren't really meant for recursive solutions, but if we have recursive regular expressions, we can define our balanced-paren expression like this: first match an opening paren; then match a series of things that can be non-parens or an another balanced-paren group; then a closing paren. Turned into Perl code, this becomes:

    $paren = qr/(([^()]+|(??{ $paren }))*)/x;
    When this is run on some text like
    (lambda (x) (append x '(hacker)))
    the following happens: we see our opening paren, so all is well. Then we see some things which are not parens (lambda ) and all is still well. Now we see (, which definitely is a paren. Our first alternative fails, we try the second alternative. Now it's finally time to interpolate what's inside the double-secret operator, which just happens to be $paren. And what does $paren tell us to match? First, an open paren - ooh, we seem to have one of those handy. Then some things which are not parens, such as x, and then we can finish this part of the match by matching a close paren. This polishes off the sub-expression, so we can go back to looking for more things that aren't parens, and so on.
    --
    Sigs cause cancer.
  4. Not many companies allow 3rd party libraries? by LadyLucky · · Score: 2, Funny

    Are you serious? What kind of company would do that? It's madness!

    --
    dominionrd.blogspot.com - Restaurants on
    1. Re:Not many companies allow 3rd party libraries? by Canthros · · Score: 2, Informative

      It does, however, simplify the legal mess involved.

      --
      Canthros
  5. My main complaint by kbielefe · · Score: 4, Informative

    My main complaint about java regexps is that all the backslashes have to be quoted with a backslash, making them completely unreadable compared to a language that supports regular expressions natively, like perl (no, a standard library is not technically native support). "\d" becomes "\\d" and so forth. Does anyone know a simple way around this? We just started using java regexp's at work, so the extra backslashes don't bother most people, but they are extremely annoying to those of us with a lot of perl experience.

    P.S. How many slashdotters thought they'd be rolling in their graves by the time they heard an example of where perl is more readable than java?

    --
    This space intentionally left blank.
    1. Re:My main complaint by Kesch · · Score: 4, Funny
      P.S. How many slashdotters thought they'd be rolling in their graves by the time they heard an example of where perl is more readable than java?


      I'm still amazed to find 'readable' and 'regular expressions' in the same context.
      --
      If this signature is witty enough, maybe somebody will like me.
    2. Re:My main complaint by Pxtl · · Score: 3, Interesting

      Well, does Java have a facility similar to C#'s @strings? In C#, a string prefixed with @ is literal, much like Python's """ strings - no escape characters. Very handy for regular expressions.

      In general, C#'s regular expression package is very nice, except for the whole "groups" and "captures" thing.

    3. Re:My main complaint by happyfrogcow · · Score: 2, Insightful

      two slashes "\\" is nothing. the real PITA begins when you need to do "\\\\"

      effing java.

    4. Re:My main complaint by masklinn · · Score: 3, Informative

      Actually, Python's literal strings are NOT """

      .

      """ is for multiline strings (' and " only accept one-line strings or antislash linebreak escapers), literal python strings are rawstrings and created by prefixing any string (be it ', " or """) by the "r" character (as in r"this is a raw strings" "but this is not).

      --
      "The way we can tell it's C# instead of Haskell is because it's nine lines instead of two." -- wadler
    5. Re:My main complaint by _xeno_ · · Score: 4, Informative

      Backslashes in a .properties file have to be escaped with (guess what?) a backslash.

      So it, unfortunately, solves nothing.

      If you don't mind XML, you can use the XML properties format, but you're still adding a lot of extra code just so you don't have to deal with escape characters. There's, unfortunately, no good solution in Java. (There are no raw strings in Java.)

      --
      You are in a maze of twisty little relative jumps, all alike.
    6. Re:My main complaint by Deef · · Score: 2, Interesting

      I sometimes do this:

      Pattern foo = Pattern.compile("c:/foo/bar".replace('/','\\'));

      or just put the above in a library method that does it automatically:

      Pattern foo = PatternUtils.compile("c:/foo/bar");

      which is handy if other replacements are made by that library method also:

      Pattern foo = PatternUtils.compile("({number}):{number}:({identi fier})-{number}");

    7. Re:My main complaint by Chris+Pimlott · · Score: 2, Informative
      If you use Jakarta Commons-Configuration, there's basically no extra code to use XML configuration files.

      For example, the regex defined here:
      <foo>
          <bar>
              <regex>...</regex>
          </bar>
      </foo>
      becomes simply "foo.bar.regex", just like a standard properties file.
  6. Microsoft and regex by truthsearch · · Score: 3, Interesting

    Slightly off-topic, but...

    Back when my only experience was development on Windows I was very frustrated with the lack of good string handling in Microsoft languages (VB, T-SQL). If you didn't find a third-party library you had to write a lot of expensive code to do fancy string searches. Try writing recursion in VB6 without bringing your computer to a screeching halt.

    Then when I switched to linux and open source I was shocked to learn that something as useful as regex had already been around for many years. Most of the Windows developers I knew never even heard of it. It was tricky to learn but has paid off many times over in utility.

    Every developer is better of for knowing it. Even if they never use regex the thought process in understanding it is quite interesting and educational.

  7. What? by avalys · · Score: 2, Interesting

    Sure, there were libraries like ORO that would provide regex support, but it wasn't built in and not many companies allow the use of 3rd party libraries
    Who's boneheaded enough to do this? I want to know so I can avoid buying anything from them, because their products are going to be overpriced by at least 50% due to the wasted effort.

    I can understand restricting third-party libraries to those of a certain license, like BSD or LGPL, but a blanket ban without any exceptions for something as essential as regular expressions? That's just stupid.

    One of the biggest advantages of Java is the enormous number of high-quality third-party libraries available.

    Is this just something the submitter dreamed up to fill space, or do companies actually do this?

    --
    This space intentionally left blank.
    1. Re:What? by kalirion · · Score: 2, Funny

      Pssst, Mr. Secrecy, your blog is showing.

  8. Re:Recursion? by addaon · · Score: 3, Informative

    Of course, things like those presented are not regular expressions, no matter how loose perl might be with the term.

    --

    I've had this sig for three days.
  9. Re:Recursion? by kfg · · Score: 2, Funny

    Be kind to your parens, though they don't deserve it. . .

    KFG

  10. Re:Recursion? by Reverend528 · · Score: 2, Informative
    I tried to do a bit of recursion in regexes once, like ((\d+)\.)+, but that didn't work.

    By definition, Regular Expressions are limited to regular languages, thus can be expressed by Finite Automata. This prohibits them from supporting recursion, but generally makes them easy to optimize.

  11. Re:Recursion? by Anonymous Coward · · Score: 4, Informative

    Regular expressions are only for regular languages. They are the simplest type of language and use a simple state machine (automaton) to do their language recognition.
    Context free languages may have recursion. They use a state machine (pushdown automaton) and a stack to recognize thier languages.
    http://en.wikipedia.org/wiki/Context-free_language
    This also contains links to other families of language and info on the automaton that can recognize them.
    Welcome to Theory of Computing!

  12. Re:Wha-wha-what? by JoshDM · · Score: 2, Insightful

    ...and not many companies allow the use of 3rd party libraries.
    Who are these companies and what can possibly be their justification for such a blanket policy.

    Actually there are a number of firms that contain multitudes of red tape that disable their employees from getting anything done without the barest of tools. I have witnessed major separations of "church and state" with these larger companies. This includes the company that did not allow the developers access to the servers, resulting in a system administrator who refused to allow a Java web server more powerful than JServ because he didn't know how to properly install Apache/Tomcat/JBoss/Whatever on Linux.

    More recently, it's a concern with larger companies that want "someone to blame" and "someone to call for support." These places use "Websphere" instead of "Eclipse and Tomcat" or "Oracle JDeveloper" instead of "Borland JBuilder". Wherever there is a "free" version of something that is supported by a community effort, there is a "pay" edition of that same item (usually 1-2 versions behind the curve) hosted by a company that sells support and takes the blame.

  13. Re:Wrong way round by smallfries · · Score: 2, Informative

    I'm not sure if you got the parents point (apologies if you did). By trivial string handling he's talking about recursive structures, and the erroneous strings he's mentioning are probably programs as input to a compiler. The 'non-trivial' strings are the class of strings that you would need a full grammar in order to parse, rather than a reg-exp. But yeah, not every time - horses for courses and all that.

    --
    Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
  14. regex coach by mgkimsal2 · · Score: 3, Informative

    I spoke about the "regex coach" tool from http://weitz.de/regex-coach/ on my podcast (shameless plug!) http://webdevradio.com/ - it's a great tool for helping visually walk through the regex creation process, especially for complex needs.

    1. Re:regex coach by sickofthisshit · · Score: 2, Informative

      This tool, by the way, was written in Common Lisp, using Edi's own library

      CL-PPCRE - portable Perl-compatible regular expressions for Common Lisp

      A library which typically outperforms Perl's own regex engine.

  15. Re:Wrong way round by smittyoneeach · · Score: 3, Insightful

    I would assert that if your input data are sufficiently irregular that you require a parser/lexical analyzer, you may have exceeded the bounds of "regular" expressions.

    --
    Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear
  16. RegEx not so maintainable... by Heembo · · Score: 2, Interesting

    One of the reasons we as programmers write code is to take a very complex idea, like a software application, and write something that a human engineer can understand. The KISS principle especially applies to coders.

    As I get older, my code has gotten more and more straightforward, cause I consider to maintainance cycle of code to be more than 95% of the puzzle. And these days, I have more than one security analyst who is not a senior software engineer poking around me code.

    RegEx's are not-so-readable and not-very-maintainable programming abstracts that should be avoided whenever possible. I prefer using string manipulation abstraction classes (such as my own version of StringTokenizer). They are not as fast and furious as other methods like lexical analysis, and the code is more bloated, but the code is Straight Forward And Easy To Read. There is a power is code of this nature, and my clients have thanked me more than once to not focusing on writing "cool code" but for writing "clean and simple" code. I just tried to paste in a few ugly regex samples, but slashdot blocked me calling them "junk characters" I agree! :)

    For example, take XPATH, this is a clean and simple way to address XML objects. Sure, there is an additional level of abstraction, but you can look at an XPATH query, even from a layman's point of view, and have a clear understanding as to what it is doing.

    --
    Horns are really just a broken halo.
    1. Re:RegEx not so maintainable... by mongus · · Score: 2, Insightful

      I used to think the same thing. Back in '99 a guy I was working with would produce a regex and I had no idea what that strange looking thing did. I got a book on Perl and spent quite a bit of time wrapping my head around regular expressions. That's probably the only thing I retained from Perl because I really don't like the language. I started using the ORO package in Java to do regular expressions and switched to the standard library when it was introduced in 1.4. Java's syntax is nearly identical to Perl's.

      If you'll take the time to understand them you'll never go back to parsing strings yourself. They can make your code MUCH easier to maintain. There is a steep learning curve but they're well worth learning. Your code will be much more readable with a regular expression instead of lines and lines of code. Debugging is much easier too.

      Maybe you should give the reviewed book a shot. I can't comment on it as I've never read it but I do highly recommend learning regular expressions.

  17. Re:Java sucks by vingilot · · Score: 2, Informative

    Come on:
    "Some String".replaceAll("Java", "Bloated piece of shit")

    And FYI PatternSyntaxException is a runtime exception so no need to catch it and rethrow as a RuntimeException.

    so to write it your way:

    String theTruth(String s){
            return Pattern.compile("Java").matcher().replaceAll(s);
    }

  18. Re:Java sucks by computational+super · · Score: 3, Funny

    Oh, I think you're hardly being fair to Java - your example was artificially bloated. I can easily do this in one line in Java:

    Runtime.getRuntime( ).exec( "perl -e 'sub theTruth($) { shift; $_ =~ s/Java/Not so bad now/; return $_; }" );

    I think you owe Java an apology.

    --
    Proud neuron in the Slashdot hivemind since 2002.
  19. Re:Java sucks by Derkec · · Score: 4, Informative
    You don't have to throw anything there, you should just have one clear return in your method. You also probably should't be compiling your pattern every time.

    Try:
    private static final Pattern pattern = null;
     
    static {
      try { pattern = Pattern.compile("Java"); } catch (PatternSytaxException pse) {;}
    }
     
    public String theTruth(String string) {
      Matcher matcher = pattern.matcher(string);
      return matcher.replaceAll("something I don't know jack shit about");
    }
    Still not as compact but at least there aren't any tildes in there. I wonder if there would be a more compact way to do it. This seems terribly heavy weight for such a simple example. Oh, wait! There is!
    public String theTruth(String string) {
      return string.replaceAll("Java", "this is really easy");
    }
    So now we compare:
    public String theTruth(String s) { return s.replaceAll("Java", "this is easy") };
    To:
    sub theTruth($) { shift; $_ =~ s/Java/Bloated piece of shit/; return $_; }
    So the Java code ends up being a handful of characters longer and much easier to read. I'm not saying that Java is the ideal Regex language, but your example sucked.
  20. Re:Rapid Java Regex Prototyping by Abcd1234 · · Score: 2, Insightful

    Am I the only one that finds it quite easy to get regexs right just by, you know, typing them in? If a regex fails for me, 99% of the time, it's because my input data is in a different format from what I expected. But I've almost never needed any kind of "explorer" tool... that smacks of "tweak it until it works", which is never a good idea, IMHO...

  21. Re:Java sucks by cowboy76Spain · · Score: 2, Insightful

    Apart from the fact that your code is the worst that you can write when using RegEx in Java (as pointed by another post, RTFApi doc if you want to use Java properly), it amuses me that you are complaining that Java (a language designed for using strong OO and being multiplatform) is slower than Perl (a language designed for processing regular expressions).

    You could have said also that the Fire Department sucks because they are not good at catching burglars, or that the Police Department is full of losers because they can not put down a fire. Myself, I will keep using the FD to deal with fire and the PD to deal with crimes.

    --
    Why can't /. have a rich-text editor? Editing your own HTML is so XXth century.