Parsing Algorithms and Resources?
Derek Williams asks: "I'm a senior majoring in computer engineering & computer science and I've been programming for about 7 years, mainly in C and Java. While I've had quite a few courses that delve into some of the deeper topics of programming (e.g. Object Oriented Design), I find that the majority of programs I write, both for work and elsewhere, involve parsing. Although I have no problem tackling these sorts of programs, I was wondering if there was some branch of computer science dedicated to the study of parsing. What books and websites out there are of interest to someone looking to learn more about parsing and algorithms relating to it?"
...they are having a party.
Parsing Techniques - A Practical Guide
Flexible Parsing
Workshop on The Evaluation of Parsing Systems
Robust Parsing
Parsing Resources
Probably the last one on that list would be the most useful starting place...
...that couldn't even get first post.
Having a party.
Your parsing adventure has barely begun.
Until you've parsed the Linux kernel in a single regex, you haven't lived.
I have been pwned because my
Compilers and parsing seem to go hand-in-hand. With that in mind, here's a link to Jack Crenshaw's How to Build a Compiler
you should at least look to the results of the search. One of the links is about compression algorithms and other two are about natural language parsing.
The following two are the only ones relevant:
Stepan Kasal
In short, unless you have a very good reason (i.e. you know what you are doing) to write your own parsers, you should be using a parser generator (use google with the right keywords and you will find plenty of resources) or regular expressions (for the simple stuff, again: google is your best friend).
Jilles
Try some books on Compiler Design, especially the first chapters about analysis. They should provide lots of information about parsing. One good book is Modern Compiler Design .
Start with Learning Perl, proceed with Programming Perl and finish off with Mastering Regular Expressions. Your parsing needs will be filled forever.
You've got a couple choices -- finding yourself a good regular expression library seems like a good start ;-) If you're looking to do something a little more interesting than just lexical analysis, check out the red dragon book (better known as Compilers: Principles, Techniques, and Tools by Aho, Sethi & Ullman. I used it in my compiler course and I can tell you that they hit all the various parsing techniques (recursive descent, LA, LALR, SLR, etc.) very well, along with some other stuff. They concentrate on Lex/Yacc as tools -- you may prefer to check out ANTLR -- Terrence Parr's parser generator. It can be targeted at a bunch of languages and can also produce tree walkers for when it comes time to use your parsed data.
For reading about parsing (and regular expressions), the book of O'Reilly "Sed and Awk" is a good start.
l
:)
Based (and extended), you'll find a lot of information in "Programming Perl".
And if you're writing in C, again the O'reilly book "lex and awk".
http://py-howto.sourceforge.net/regex/regex.htm
http://sitescooper.org/tao_regexps.html
Hey, I can't help it, that I find the books by that publisher exactly what I'm looking for in my programming needs
Genius doesn't work on an assembly line basis. You can't simply say, "Today I will be brilliant."
The best tools to build parsers and manipulate parsed syntax trees are functional languages, such as OCaml (with its streams parsers, camlp4, ocamllex/ocamlyacc, etc.), or SML, Haskell, etc.
Of course, if you were a LISPer, you'd know that although you have lots of well-known tools to build new parsers, such as Meta or Zebu, the best thing to do about a parser is not to write it, but rather to reuse the builtin extensible universal parser, READ, and its extensible universal unparser, WRITE.
If you spend most of your time writing parsers, you're not just using the wrong tools, you're also using the wrong approach.
Just my .2 mg of e-gold worth...
-- Faré @ TUNES.org
Reflection & Cybernet
Well...the Dragon book for starters, as mentioned earlier. That's probably the ur-source for most of the theory behind the magic. Makes my head hurt, though.
Terence Parr's book, Practical Computer Language Recognition and Translation (out of print). His doctoral dissertation is a useful thing too (try the Purdue University library).
comp.compilers is another useful resource. It's archived at http://compilers.iecc.com.
Alan Holub's Compiler Design in C is a classic.
The ACM's SIGPLAN ("Special Interest Group On Programming Languages") and it's journal SIGPLAN Notices of the ACM are all fine resources. So is ACM Transactions on Programming Languages and Systems.
Don't forget the IEEE as well.
Not to mention Abelman and Sussman: Structure and Interpretation of Computer Programs.
The garbage collection page is a good source for information on memory management and garbage collection.
Your university's library is another good resource.
Well. That should keep you out of trouble.
N. --
Finite automata & Natural Language. Study of Finite State Machines progressing to more complicated grammars and languages. At PSU it's a relatively difficult undergrad elective.
Very informative and much more interesting than most of my other courses. Doesn't teach you lex/yacc (not that hands-on), but you will be better grounded in regular expressions, languages, parsing, etc.
--Mike--
I'm surprised that no one has mentioned lex and yacc. Guys, perl != generic parser. If you're looking to delve into the science of scanners and parsers, from tokenizing to LR1 grammars, look at lex and yacc. Their java equivalents, JLex and CUP, are just as good for that language (and a whole lot easier to use, I might add). OReilly has a small book on using them, and it would probably be a good introduction to the science of parsing - not to mention the foundation of compilers.
Much of parsing has to do with regular expressions, which, in, turn, deal with NFAs/DFAs. Most parsing, in the end, comes to these finite automata and has been on most schools' CS curriculi(sp?) for many many years.
While it's kind of scary that you're a senior with 7 years of experience and don't know this, it should be easy to catch up by reading the beginning of a compilers book. Older compiler books tend to concentrate on parsing a lot, since parsing used to be "most" of the work involved in compiling. Actually, all the theory on how to conserve the last bit of memory using some kind of wacky table is pretty useless these days, but it's there.
Though perl's built-in regexp support makes its syntax more brief for matching regular grammars, many grammars are not regular (The grammars of most programming languages are context free or worse!). For those grammars, a parser generator like bison can make life a bit easier.
Then, manipulating parse trees in a language like C or perl is pretty painful once you've used datatypes and pattern matching in a language like SML, O'Caml, or Haskell. If you're going to be doing a lot of work on the parsed data, and its structure is not trivial, give these a shot.
Finally, about the only thing that's excited me about parsing for years is the idea of parsing combinators: writing parsers that look like the grammar IN the language itself. They can allow the parsing of arbitrary grammars, and don't require resorting to external tools. There's a good beginning explanation in Paulson's "ML for the Working Programmer", though I can't find a tutorial online... maybe it's time to write one!
Of course there's a branch of CS that studies parsing. It's called the Theory of Computation. Any good introductory Theory of Comp. book will give you an introduction to parsing. Any decent book on compiler construction will give you the gory details about implementing parsers. What do they teach CS students these days, if a senior doesn't know this?
Your needs for parsing regular (or regular-ish) grammars in perl will be filled. Unfortunately, regular expressions are not always the right tool for the job (see the gargantuan e-mail address parser in the back of the MRE book for an example) because many grammars are not regular. Try parsing (say) C with perl! Also, perl has poor facilities for manipulating such parse trees, especially compared to a language with datatypes and pattern matching like SML or O'caml.
Perl makes parsing flat data brief (though regular expression libraries in other high level languages make it just as easy, if a little more verbose), but it is not by any stretch the be-all and end-all of parsing.
The Dragon Book is the standard reference on this subject.
However, I've found that Constructing Language Processors for Little Languages by Randy M. Kaplan helped me more in constructing small one-off parsers for various purposes. The Dragon Book provides a good grounding in theory, but I've found the utilitarian approach in Little Languages more accessible.
I have found this book very good. I am a Java programmer and I wanted to understand parsing from the bottom up, before using something like ANTLR. This books gives a great foundation with code to explain it all.
mp3's are only for those with bad memories
We had 2 courses at school, the first being abstract machines, languages, and computations (regular expressions, grammars, automata, turing machines). The second was application of the first - programming languages and their processors.
See "Introduction to Languages and the Theory of Computation" by Martin.
The front end of every language compiler or interpreter is a parser. Start there. Lex, Yacc and ANTLR are good places to start. Perl was orginally designed to parsing files and most languages have parsing libraries in them but their value can vary greatly.
then C++ shouldn't scare you too much. Check out the Spirit parser library. In a nutshell, it's kind of like Lex and Yacc, but without the need for seperate tools, ie it's all done with standard C++. On the theory side, while everybody is mentioning compiler books, I find those to be a little rushed. Get a good book on the theory of computation (sorry, don't remember the name of the one I like :( ). I found their explanations of automata, regular expressions, grammers, etc etc much more elucidating than the compiler books. I think that this is because the comp theory books treat them for their own sake instead of in passing. And, of course, if you're looking for algorithms, Knuth has plenty :)
The following is not meant as a flame.
This is an odd question coming from a cs/ce senior. What, may I ask, have you been studying, when you should have been in a compiler course?
I've been out of school for a while, but it is inconceivable that compilers would now be an optional part of any CS curriculum.
- Programming Language Research - Links maintained by a CMU student.
- Compilers.Net
- Lambda the Ultimate - I found this from Meerkat. While somewhat more esoteric than straight up parsing talk, I'm seeing it spawn alot of programming language discussion across blogs.
Just to be redundant though, none of these sites are a replacement for a good low level book, like the Aho, Sethi, Ulman text. Of course, if you are looking to really bake your noodle, go straight to computational linguistics, starting with Chomsky.*Smirk*
While most of the responses have to do with parsing computer languages, there are also more general parsers -- particularly ones aimed at natural languages. Web search will also find these, but look at Bill Woods' augmented transition networks and the Kay parsing algorithm.
While there's lots of theory, I've been living for years off the same small set of tools:
For C/C++, there's YACC/LEX or BISON/FLEX.
For Java, there's CUP/JFLEX.
They are for the undergrads at San Diego State! I am planning on taking the class because I can't imagine not taking the subject. However, they require systems programming which glosses over compilers along with linkers, loaders, etc.
Okay, lots of you guys are bashing "C" here.
I say you're wrong. "C" is the ideal tool
for smashing through some data and finding
what you want. 90% of the time, cranking up
vi/emacs and running code through gcc/cc and
pumping your data and futzing with the data
in a "gets()" or custom "get_token()" routine
is just what you want. Yeah, yeah, PERL's fine
for some stuff. Regardles, "C" is the more powerful tool and run's fast too.
I can't speak for other folks, but at GMU, compilers isn't required -- we're required to take CS330 - Formal Methods and Models. I don't know about the rest of y'all, but I had a hard time staying awake through my automata/logic/language type classes.
Some have mentioned the dragon book (Aho, Ullman et all) and the tiger book (Appel's "Modern Compiler Implementation"). Both excellent references.
If you're at all mathematically minded, you should look into a good book on formal languages, such as Harrison's "Introduction to Formal Language Theory", or R. Nigel Horspool's "LR Parsing". (Caution: Don't buy these. Look for them in your local university library.)
Once you have the basics, you should look into the work of Mark Hopkins. The guy is a genius, and understands more of the theory of parsing than anyone I've ever seen, and has invented some techniques which far surpass the "state of the art" of lex and yacc.
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
-Kevin