Java Regular Expressions
Simon P. Chappell writes "Regular expressions (regex to their friends) are an incredibly powerful addition to most programmer's personal toolkit of techniques. Programming using a language that doesn't support them can be frustrating if you need to do any amount of non-trivial string handling. Java was just such a language until the release of the 1.4.x series. Sure, there were libraries like ORO that would provide regex support, but it wasn't built in and not many companies allow the use of 3rd party libraries. With version 1.4.x, the corporate Java developer in the trench, received the power of regular expression pattern matching." Read the rest of Simon's review.
Java Regular Expressions
author
Mehran Habibi
pages
255 (7 page index)
publisher
Apress
rating
8/10
reviewer
Simon P. Chappell
ISBN
1590591070
summary
A great starter for using regular expressions in Java
The book seems targeted towards those who have a solid level of Java programming skills, but who have not yet used the java.util.regex package. I see two types of Java programmers who might not have used the regex package, those who do not know about regular expressions and those who know them, but have not yet used them within Java. This book should satisfy both sets of users. The first group will be benefited by the general introduction to regular expressions and the gentle introduction to using them within Java. The later group will benefit from the more advanced material in the book.
The book is nicely structured and progresses easily through its subject matter. The first chapter is an introduction to regular expressions. While this is most obviously for the readers new to the subject, it will be useful for those more experienced, because not all regex engines are created equal and this chapter lays out the particular dialect of regular expressions used by the Java 1.4.x regex engine. The second chapter introduces the object model used by java.util.regex. This gives detailed explanations of the Pattern and Matcher objects as well as the new regular expression methods added to the standard String class.
The third chapter takes the reader into advanced Regular expressions. While there is much that can be done using just the Pattern and Matcher objects, the path to the full power of regex travels through an understanding of groups (and subgroups) and qualifiers. Regex groups are hard to explain until you've seen them in action, whereupon you may find yourself wondering how you'd ever managed without them before. Mr. Habibi does an excellent job, both explaining them and introducing us to the unusual noncapturing subgroups. (I'd never heard of these before.) Qualifiers are the other side of the same coin with groups. While it's one thing to define a group and whether it's expected and to be captured, it's equally important to be able to describe the expected occurrence of those groups using qualifiers.
Chapter four tackles the interesting challenges of using regex in an object-oriented language. Mr. Habibi describes the general principles of use of regex as similar to those used with SQL through the JDBC interface. These principles are the optimisimg of connections, batching reads and writes, storing patterns externally, Just In Time compilation of patterns and remembering that not every piece of String handling code needs to be written as a regex. All very useful advice.
Chapter five is the big examples chapter. All of the examples are intended to be practical; the kind of thing you might have to address at the day job. With examples covering Zip codes, telephone numbers, dates, searching text files and even validating an EDI document, he seems to have delivered on that assertion. There are further examples in Appendix C, if the afore-mentioned patterns aren't enough.
The writing and progression of material are good. The examples are very well thought out and explained. Many of the examples are built from first principles. Mr. Habibi seems to want to not only teach you how to use regular expressions, but also how to design them. He does this by working up from an understanding of the data until he has a working regex.
While it doesn't make any promises about being an encyclopedia of regex patterns, this book does contain enough of the normal business patterns to be a useful initial reference work, before turning to the Internet to search for patterns.
If you want an encyclopedic reference work on regex, then buy Jeffery Friedl's Mastering Regular Expressions which is published by O'Reilly. This is not that book, preferring to stick with the practical usage of regex.
This is a great starter book, for developers who are new to using regular expressions in Java."
You can purchase Java Regular Expressions from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
The book seems targeted towards those who have a solid level of Java programming skills, but who have not yet used the java.util.regex package. I see two types of Java programmers who might not have used the regex package, those who do not know about regular expressions and those who know them, but have not yet used them within Java. This book should satisfy both sets of users. The first group will be benefited by the general introduction to regular expressions and the gentle introduction to using them within Java. The later group will benefit from the more advanced material in the book.
The book is nicely structured and progresses easily through its subject matter. The first chapter is an introduction to regular expressions. While this is most obviously for the readers new to the subject, it will be useful for those more experienced, because not all regex engines are created equal and this chapter lays out the particular dialect of regular expressions used by the Java 1.4.x regex engine. The second chapter introduces the object model used by java.util.regex. This gives detailed explanations of the Pattern and Matcher objects as well as the new regular expression methods added to the standard String class.
The third chapter takes the reader into advanced Regular expressions. While there is much that can be done using just the Pattern and Matcher objects, the path to the full power of regex travels through an understanding of groups (and subgroups) and qualifiers. Regex groups are hard to explain until you've seen them in action, whereupon you may find yourself wondering how you'd ever managed without them before. Mr. Habibi does an excellent job, both explaining them and introducing us to the unusual noncapturing subgroups. (I'd never heard of these before.) Qualifiers are the other side of the same coin with groups. While it's one thing to define a group and whether it's expected and to be captured, it's equally important to be able to describe the expected occurrence of those groups using qualifiers.
Chapter four tackles the interesting challenges of using regex in an object-oriented language. Mr. Habibi describes the general principles of use of regex as similar to those used with SQL through the JDBC interface. These principles are the optimisimg of connections, batching reads and writes, storing patterns externally, Just In Time compilation of patterns and remembering that not every piece of String handling code needs to be written as a regex. All very useful advice.
Chapter five is the big examples chapter. All of the examples are intended to be practical; the kind of thing you might have to address at the day job. With examples covering Zip codes, telephone numbers, dates, searching text files and even validating an EDI document, he seems to have delivered on that assertion. There are further examples in Appendix C, if the afore-mentioned patterns aren't enough.
The writing and progression of material are good. The examples are very well thought out and explained. Many of the examples are built from first principles. Mr. Habibi seems to want to not only teach you how to use regular expressions, but also how to design them. He does this by working up from an understanding of the data until he has a working regex.
While it doesn't make any promises about being an encyclopedia of regex patterns, this book does contain enough of the normal business patterns to be a useful initial reference work, before turning to the Internet to search for patterns.
If you want an encyclopedic reference work on regex, then buy Jeffery Friedl's Mastering Regular Expressions which is published by O'Reilly. This is not that book, preferring to stick with the practical usage of regex.
This is a great starter book, for developers who are new to using regular expressions in Java."
You can purchase Java Regular Expressions from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
However, like many things in computer science, speed gains come at a price. In this case, the regular expression language supported is not quite as rich as the JDK implementation.
Sigs cause cancer.
Me: I'll have a Grande Cafe au Lait please.
/me hands over cash, takes careful first sip.
Starbucks Employee: That'll be an hour's wages please.
Me: Thanks!
Thats when you get to see my java regular expression.
Generally it will be me wincing in pain because I just burned my tongue. Sometimes, if it's cooled enough, you'll hear a quiet "MmmMmmm" in the style of Family Guy's Herbert.
I tried to do a bit of recursion in regexes once, like ((\d+)\.)+, but that didn't work. It's too bad, because I don't think there's another way to dynamically match data in regexes. Other than this, they've served me very well all these years.
Send email from the afterlife! Write your e-will at Dead Man's Switch.
s/Java/Bloated\ piece\ of\ shit/
Regular expressions (regex to their friends) are an incredibly powerful addition to most programmer's personal toolkit of techniques. Programming using a language that doesn't support them can be frustrating if you need to do any amount of non-trivial string handling.
Er, no. It is only for trivial string handling that the regex approach is useful.
For non-trivial string handling (particularly if you feel like giving the authors of erroneous strings helpful error messages!!) I'll write a proper lexical analyser and a proper parser every time.
Although I know Slashdot gets kickbacks from B & N, it seems inconsiderate of them to make their readership pay too much when Amazon has the book cheaper
.Of course, if you're using a language that doesn't have built-in regular expressions, you might
still have good regular expression libraries available to you. Boost::Regex is a great choice
for C++, for instance.
Are you serious? What kind of company would do that? It's madness!
dominionrd.blogspot.com - Restaurants on
Linux is STILL for dirty faggots.
My main complaint about java regexps is that all the backslashes have to be quoted with a backslash, making them completely unreadable compared to a language that supports regular expressions natively, like perl (no, a standard library is not technically native support). "\d" becomes "\\d" and so forth. Does anyone know a simple way around this? We just started using java regexp's at work, so the extra backslashes don't bother most people, but they are extremely annoying to those of us with a lot of perl experience.
P.S. How many slashdotters thought they'd be rolling in their graves by the time they heard an example of where perl is more readable than java?
This space intentionally left blank.
The missing Regular Expressions is what kept me off Java and on Perl for a looong while. I started using ORO and since their introduction into Java itself I almost completely switched over. I relly do hope Perl 6 will be released and lives up to its expectations.
Having said that I really don't see why you have to devote a complete book on regex. A small tutorial does just fine.
I hadn't the slightest objection to his spending his time planning massacres for the bourgeoisie... (P.G. Wodehouse)
Slightly off-topic, but...
Back when my only experience was development on Windows I was very frustrated with the lack of good string handling in Microsoft languages (VB, T-SQL). If you didn't find a third-party library you had to write a lot of expensive code to do fancy string searches. Try writing recursion in VB6 without bringing your computer to a screeching halt.
Then when I switched to linux and open source I was shocked to learn that something as useful as regex had already been around for many years. Most of the Windows developers I knew never even heard of it. It was tricky to learn but has paid off many times over in utility.
Every developer is better of for knowing it. Even if they never use regex the thought process in understanding it is quite interesting and educational.
Developers: We can use your help.
C is an incredibly powerful addition to most programmer's personal toolkit of techniques.
..oh ..we are talking about CS students who discover the joys of the likes of Java on their long path from..
...to...
...
...???
;-)
10 print "hello world"
20 goto 10
struct filter {
int (*open) (void *);
int (*close) (void *);
};
Nevermind then... come back in 10 years... (if you're still a programmer by then
Sure, there were libraries like ORO that would provide regex support, but it wasn't built in and not many companies allow the use of 3rd party libraries
Who's boneheaded enough to do this? I want to know so I can avoid buying anything from them, because their products are going to be overpriced by at least 50% due to the wasted effort.
I can understand restricting third-party libraries to those of a certain license, like BSD or LGPL, but a blanket ban without any exceptions for something as essential as regular expressions? That's just stupid.
One of the biggest advantages of Java is the enormous number of high-quality third-party libraries available.
Is this just something the submitter dreamed up to fill space, or do companies actually do this?
This space intentionally left blank.
Who are these companies and what can possibly be their justification for such a blanket policy. I can understand for some ultra-high security/uptime systems with incredibly strict standards and processes who would need to put third party code through an extensive and expensive audit. But for the rest of us? No jUnit? log4j? Is Boost allowed? Good lord, I can't imagine programming in such a world.
I hope I never work for one of these firms.
Taft
I beleive fear is the primary culprit here. Many places I've worked for/with only allow internally developed library use... And I'm sure half of it is swiped, stolen, or 'inspired' by popular, free, open source, 3rd party libraries.
Java...Regular expressions? Error!!!
Regular expressions belong in a real programming language like Perl where they
seamlessly blend in with the arcane chaos that looks like when the Dyslexic Liberation Front
blew up the alphabet spaghetti factory.
Why would Java "programmers" sully their nice looking suburban code with ascii vomit?
It's just not natural.
I'm at one right now (hence why I'm posting as an AC), and my previous employer was like that as well (except we were allowed minimal use of Struts on one project). It's typical "not invented here" reasoning, usually from "software architects" convinced their own home-grown platform/library/framework is better than anything else out there.
In my experience, it leads to systems with too long of a ramp-up time for new hires to start working on and delays to tweak the library for every new thing the developers are trying to accomplish. But it doesn't matter that a simple project took months to accomplish, as long as there's a perfect (in their eyes) foundation they can sneak out the back door when they finally leave.
Somebody hasn't worked for "many companies." _Every_ company I've worked for allowed 3rd party libraries. (Sure, there are processes to make sure you don't do something stupid like ship a GPL library with a closed-source product, but that's just common sense.)
I spoke about the "regex coach" tool from http://weitz.de/regex-coach/ on my podcast (shameless plug!) http://webdevradio.com/ - it's a great tool for helping visually walk through the regex creation process, especially for complex needs.
creation science book
Save yourself $14.80 by buying the book here: Java Regular Expressions. And if you use the "secret" A9.com discount, you can save an extra 1.57%! That's a total savings of $15.20, or 38.58%!
One of the reasons we as programmers write code is to take a very complex idea, like a software application, and write something that a human engineer can understand. The KISS principle especially applies to coders.
:)
As I get older, my code has gotten more and more straightforward, cause I consider to maintainance cycle of code to be more than 95% of the puzzle. And these days, I have more than one security analyst who is not a senior software engineer poking around me code.
RegEx's are not-so-readable and not-very-maintainable programming abstracts that should be avoided whenever possible. I prefer using string manipulation abstraction classes (such as my own version of StringTokenizer). They are not as fast and furious as other methods like lexical analysis, and the code is more bloated, but the code is Straight Forward And Easy To Read. There is a power is code of this nature, and my clients have thanked me more than once to not focusing on writing "cool code" but for writing "clean and simple" code. I just tried to paste in a few ugly regex samples, but slashdot blocked me calling them "junk characters" I agree!
For example, take XPATH, this is a clean and simple way to address XML objects. Sure, there is an additional level of abstraction, but you can look at an XPATH query, even from a layman's point of view, and have a clear understanding as to what it is doing.
Horns are really just a broken halo.
ok, java sucks, (it compiles to VM code, not native code). this is the same reason why C# sucks C++ is by far my fav language. also, out of all the new programming languages, the only one that is any good is D, see why at http://digitalmars.com/d/comparison.html
Really, Java is not meant to be a string processing utility. It is honestly too slow with too much overhead for this type of functionality. Regex expressions were meant to be used in the occassional light occurrence of string processing in Java. If you are really needing some string processing, like over a large dataset, stick with something like Python which is based on C++. It is fast with some very cool tools, such as regex, dictionary use, etc. Even if you need a light GUI, you could always interlace some Python with TK.
Development notes at http://devscribbles.blogspot.com
I'm glad that it's there, and I suppose it was useful during my prototype phase, but a little profiling revealed that my app was spending half its time parsing input. Dumping out the input to String and sometimes char[] and doing the parsing myself in hand tooled code almost completely erased the speed hit I was taking on load.
Start Running Better Polls
Any company that doesn't allow, nay, embrace third party jarballs is missing 98% of the point of Java. The language is so-so, the built in libraries are nice, but not infinite - but the ability to load componentized, versioned, packaged third-party tools is priceless.
If I were to ask everyone to start programming in assembly language, I suspect that I would be laughed at. Yet with regular expressions that is exactly what we are doing. If you take a look at the history of regular expressions, you will find staring right back at you the guts of compiler theory with state machines, finite state automatia, etc. Instead of asking for regular expressions, programmers should be asking for higher level pattern matching facilities. Something as simple as finding the balanced parentheses in the string: (a+b)/((c-d)+e) using a regular expression is difficult. Yet there have been languages that have advanced string matching capabilities around since the 60's (start looking at Snobol -- which is still alive -- and some of it's descendants).
And I find it much easier to follow.
Rediculous: A word indicating the writer is ridiculously ignorant.
That depends whether searching for content in a string is "trivial" or not. More likely, it comes from only encountering complex problems in one of the two subsets and only trivial problems in the other. There are non-trivial problems in both subsets.
...
:-(
The subset of problems you use a regex for are those where there are non-trivial patterns in the text that you wish to extract. The subset of problems you use a parser/lexer for are those where there is some formal model that describes the syntax the input is expected to have.
These two problems sets do NOT often overlap. If you're using the wrong tool for the wrong problem, you're in for a world of hurt. You do NOT want to parse XML/HTML/etc. with regexes (you can do a few things, but you open yourself up to a world of well-deserved pain when you realize the true evils of nesting and how they affect regexes).
Similarly, there's no way in hell you want to search unstructured text with a parser/lexer. Yes, *unstructured* data. Programmers actually deal with that from time to time. It's when we use regexes. You know, when searching for *patterns*
I've written both. I've used both. They're both great problem solving approaches, but using the wrong one invites pain. Sure, maybe you can get by with a half-assed system that has bugs your users will never find (e.g. they won't nest anything too deeply for the XML regex to find), but it's still a bad idea.
So please, please use the right tool for the job. With my luck, I'll get stuck maintaining your code if you don't
I recently wrote a small app based on "Filter Builder" by ActiveState. It's called Pattern Sandbox and has helped me rapidly prototype regexes for both Java and Perl (because the Java dialect is very similar to Perl's). I made Pattern Sandbox because it was so annoying to write a regex, compile, get to that part of the code/interface, and then finally try it just to find that it does not work correctly so I have to repeat this process until I get it right. If you are using Java regexes on a regular basis, Pattern Sandbox or similar tools are indispensable. Try it out and feel free to give me some feedback. I hope this is not too much of a plug, but I thought it to be very appropriate.
It is just that you should not use a fork to hammer a nail.
Balancing parentheses was just the first example my teacher told the class when explaining that regular expressions were not suited for everything and that sometimes you had to use grammars.
Why can't
Great things about the Java 1.4+ regex support, from my perspective, include that (1) it's nearly as full-featured as Perl's regexes (and thus far better than Javascript's); and (2) it's usable in web browsers and via embedded applets.
Those were both key to helping me create Regex Powertoy, a interactive visual regex tester, much like others mentioned in this discussion -- but fully implemented in a browser. It's in JavaScript and DHTML, with a Java applet for the full-featured and step-controlled regex matching -- requires FF1.5+/IE6+ & Java 1.5+.
Check it out, break it (it's still got some rough edges under heavy input), let me know how it could be improved.
Gah, and to think I passed that class :P I just hadn't realised that all that theory about automata and K* and whatnot applied to the real world!
Send email from the afterlife! Write your e-will at Dead Man's Switch.
private final Pattern methodPattern = Pattern.compile("^(.*) .* HTTP/.*$"); .* HTTP/(.*)$");
private final Pattern versionPattern = Pattern.compile("^.*
private final Pattern resourcePattern = Pattern.compile("^.* (.*) HTTP/.*$");
Happy days.
There was some weirdness with GCJ not behaving like Sun's Java, but that seems to have gone away with the last update to GCJ I did.
C-x C-s C-x k
If you only program in Java, and you have yet to use regexes, then I could see why you might possibly want this book. But how is it that much better than a general purpose regex book (of which there are several). I would think it would be more useful to have a book that covers regexes as a computing concept and then talks about the differences/limitations of different implementations (grep, sed, Java, JavaScript, Perl, etc.) Is Java still a big enough buzzword to sell books?
If you can read this sig, you're too close.
Let's light this book on fire? What else can Java do half right that's already been perfected.
You need to create two sets of FIFOs, one for to talk to your child, and for it to talk back.
You fork, then dup2 the child's STDIN to the "far end" of the former pipe,
then you dup2 the child's STDOUT onto the "far end" of the latter pipe.
Finally, you exec() in your child.
You hold onto the two near ends and use them as seperate Input/Output streams for control.
You're going to need to:
1) Catch SIGPIPE for when the spawned process closes it's reading end of the pipe.
2) Catch SIGCHLD so you know when the process exited.
3) Set your near OutputStream to autoflush mode.
On top of all this, your remote program has to be able to work in an unbuffered mode. Most command line programs don't. They are designed to work with files, and STDIN/STDOUT that are already "in the right mode", having inherited them for a program who had them attached to a TTY.
That is probably the issue you are having.
Some programs like 'cat' have a -u option which basically sets autoflush on their end so that you receive data to read as soon as it's available, and not when the fifo decides to flush.
You can stick that into the beginning of pipeline and it should encourage the others to flow if they don't have an explict unbuffered mode themselves.
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
...or like enums or any other "magic strings" that you need to make your code actually DO SOMETHING besides act as a framework passing data around...
1) POSIX classes are your friends
2) Build large regexes out of small regexes
3) Compile and name your regexes
4) Hide regex matching details inside of class methods when appropriate
I mean, what would you do if you needed a recursive decent parser? Or do we do everything via XML now?
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
...is a parser. Invented about the same time. But those are typically based on transformation rules and regular expressions to tokenize your input.
You could always build your own regular expression compiler. It's not unheard of. But I submit that the "language" is small enough that it's not worth it.
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
You might enjoy the novel way regular expressions are implemented in Scsh, the Scheme Shell.
http://www.scsh.net/
And now to celebrate this new-found ability to manipulate strings easily:
s/trench,/trench/;
Ah, I knew that would make me feel better.
not many companies allow the use of 3rd party libraries
I assume the review author hasn't worked for many companies then. I have yet to find any company the doesn't use third party packages. Logging, XML parsing and unit testing are just the first three things that spring to mind when I consider what might require a third party package. As for the "DLL hell" that someone alleges in a post to this thread, it's virtually non-existant. You ship the third party packages with your application (as a single JAR or WAR file), and rely on the accepted good practice that people don't set a default CLASSPATH these days.
Man, that's why I don't use Java. I mean - you need a whole book to learn how to use regular expressions in Java? In Perl =~ s/hard/easy/ ;-)
Zen tips: Pay attention. Don't take it personally. Believe nothing.
Aside from Boost being horrid bloatware , what exactly is wrong with the standard POSIX regexp functions? Look up regcomp() , regexec() etc which have been part of the standard C API for years.
Every Java application you will ever see has a lib directory and in it are the jar (library) files it needs. The script or shortcut you use to start the app will ensure they are on the classpath.
No set up, no messing, no conflicts with anything.
Most commercial apps (esp. on Linux) will come with their own JRE too.
"Regular expressions (regex to their friends) are an incredibly powerful addition to most programmer's personal toolkit of techniques"
Can you cite a source?
Final 2006 "Proof of Global Warming" US Hurricane Count -> 0
You can use Jet or other Native compiler for Java. It can help obfuscate the code as well. Native is go for Java.
The C based approach is necessary because it's a "unix thing" and the issues you have with external process + (x language) are OS-dependant, not language dependant.
/some/fifo | unbuffer od -t x1a | less
I don't know what the equivalent to "dup2" is in java. Ultimately it's the system call you want your language to use to make the rubber meet the road. I'm sure there's a POSIX class or something you can leverage.
(Example: In perl you'd use open with the ">=" prefix. But that lulls you into a false sense of portability. I prefer to "use POSIX qw(dup2)" and just dup2 directly.)
And I noticed that "cat -u" is useless on linux after submitting the post. Instead, check out "Expect" and the utility programs that come with it; specifically "unbuffer". It takes it's arguments and runs then with the stdout flushed for you. Unfortunately you have to use it in each stage of your pipeline. So like:
unbuffer tail -f
I thought maybe you only had to do the first one to "prime the pump", but I was wrong. The only one you don't have to do is the last one.
And in your case, since you are the final reader (and you already autoflush your writing pipe), you don't need unbuffer since you are already doing it, so to speak.
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON