Slashdot Mirror


Parsing Algorithms and Resources?

Derek Williams asks: "I'm a senior majoring in computer engineering & computer science and I've been programming for about 7 years, mainly in C and Java. While I've had quite a few courses that delve into some of the deeper topics of programming (e.g. Object Oriented Design), I find that the majority of programs I write, both for work and elsewhere, involve parsing. Although I have no problem tackling these sorts of programs, I was wondering if there was some branch of computer science dedicated to the study of parsing. What books and websites out there are of interest to someone looking to learn more about parsing and algorithms relating to it?"

52 comments

  1. All of them... by Anonymous Coward · · Score: 0

    ...they are having a party.

  2. A quick search on Google for parsing returns... by arb · · Score: 4, Informative
  3. You are a karma whore... by Anonymous Coward · · Score: 0

    ...that couldn't even get first post.

    Having a party.

  4. Learn Perl by ObviousGuy · · Score: 1

    Your parsing adventure has barely begun.

    Until you've parsed the Linux kernel in a single regex, you haven't lived.

    --
    I have been pwned because my /. password was too easy to guess.
    1. Re:Learn Perl by Anonymous Coward · · Score: 0

      dotasterisk

    2. Re:Learn Perl by Anonymous Coward · · Score: 0


      You are totally lame.
      If perl were the height of computer programming, I'd have killed myself years ago...

  5. Compilers and Parsing... by hackwrench · · Score: 3, Interesting

    Compilers and parsing seem to go hand-in-hand. With that in mind, here's a link to Jack Crenshaw's How to Build a Compiler

  6. A quick search on Google by arb by kasal · · Score: 3, Funny
    Hallo all and arb,
    you should at least look to the results of the search. One of the links is about compression algorithms and other two are about natural language parsing.

    The following two are the only ones relevant:

    Stepan Kasal

    1. Re:A quick search on Google by arb by arb · · Score: 1

      The point was a simple Google search can answer this, and most other Ask Slashdot questions that have been appearing lately. I'll admit I didn't read any of the sites I listed, because I have no interest in parsing...

  7. duh! welcome to the 21st century by jilles · · Score: 2

    In short, unless you have a very good reason (i.e. you know what you are doing) to write your own parsers, you should be using a parser generator (use google with the right keywords and you will find plenty of resources) or regular expressions (for the simple stuff, again: google is your best friend).

    --

    Jilles
  8. Read books on Compiler Design by soegoe · · Score: 2, Interesting

    Try some books on Compiler Design, especially the first chapters about analysis. They should provide lots of information about parsing. One good book is Modern Compiler Design .

    1. Re:Read books on Compiler Design by speedy1161 · · Score: 2, Interesting

      This book is heavy on the theory and such, a bit light on examples, but since they primarily use (f)lex and (b)yacc, finding examples if you get stuck isn't too hard. They dedicate a lot of pages to parsing and parser theory.

  9. Book recommendations by Evil+Attraction · · Score: 3, Informative

    Start with Learning Perl, proceed with Programming Perl and finish off with Mastering Regular Expressions. Your parsing needs will be filled forever.

  10. Red Dragon Book by blackcoot · · Score: 5, Informative

    You've got a couple choices -- finding yourself a good regular expression library seems like a good start ;-) If you're looking to do something a little more interesting than just lexical analysis, check out the red dragon book (better known as Compilers: Principles, Techniques, and Tools by Aho, Sethi & Ullman. I used it in my compiler course and I can tell you that they hit all the various parsing techniques (recursive descent, LA, LALR, SLR, etc.) very well, along with some other stuff. They concentrate on Lex/Yacc as tools -- you may prefer to check out ANTLR -- Terrence Parr's parser generator. It can be targeted at a bunch of languages and can also produce tree walkers for when it comes time to use your parsed data.

    1. Re:Red Dragon Book by klui · · Score: 1

      I'd have to agree that the dragon book is required reading if one wants to know about parser theory. However, it is a very dense book and without proper guidance, one can get very confused. Much of the examples are applicable on a theoretical level since the tools mentioned have limitations like all variables used by yacc is global to the program and is not reentrant. It's better to get the theory from dragon and then apply it to more modern tools than lex and yacc. But I haven't gone through the other links others have mentioned but I would recommend dragon as a last resort. If you want to come up to speed quickly, dragon will not help you do that.

    2. Re:Red Dragon Book by blackcoot · · Score: 1

      bison has been made to be reentrant (using the hairy skeleton) -- but this is not something for the faint of heart. just a technical point ;-) as for ANTLR, i believe that it is by design all nice and encapsulated & reentrant and stuff. altho, i really do have to wonder -- who in their right mind would want a multi-threaded parser? parsing is an inherently sequential process, so unless you're having to parse multiple streams simultaneously, i don't know why this would be of any use to you. even then, perhaps you should rethink your design -- like doing batch processing of your input and then dispatching a thread to work on the already parsed input.

    3. Re:Red Dragon Book by PD · · Score: 2

      Imagine a program, where there is an embedded scripting language. Imagine that the scripts can be attached to various objects to tie the system together. Imagine that these scripts are interpreted by a parser, and all have to run at once. Imagine that you have just found a use for a reentrant parser.

    4. Re:Red Dragon Book by blackcoot · · Score: 1

      You have me up to the last one. Parsers are not inperpreters. If you mean that you have a bunch of scripts to be interpreted concurrently, the interpreter, not the parser, needs to be reentrant. If I've understood you correctly, this problem is solved fairly easily (without requiring reentrant code parser side): the parser has an input queue containing bundles of raw source code. Parser does its trick, then spits out the resulting parse trees to an output queue where a bunch of worker threads are waiting to handle the interpreting. Considering that parsing is an inherently IO bound problem, I seriously doubt that you'll see enough of a performance difference to warrant the misery of trying to deal with the nasty mess that bison spits out when you ask it to use its reentrant skeleton. Of course, I may well be misunderstanding you.

    5. Re:Red Dragon Book by PD · · Score: 2

      Yes, you're misunderstanding me. A nice simple way of writing an interpreter is to use a parser like Bison. Write Bison rules to recognize all the statements in your language. When you've got a looping construct, just reset your input stream to the top of the loop and continue parsing. Each rule executes code that does something.

      Take a look at the ubiquitous calculator example and imagine that expanded to include all the regular language features that you'd expect to see.

    6. Re:Red Dragon Book by Mr+Z · · Score: 1

      A better example would be to consider something multithreaded, like a web-browser, loading frames. Any multithreaded program, for that matter, that might have multiple threads interested in parsing, needs a multithreaded parser. That is, unless you enjoy lock contention.

      --Joe
  11. sed and regular expressions by den_erpel · · Score: 2, Informative

    For reading about parsing (and regular expressions), the book of O'Reilly "Sed and Awk" is a good start.
    Based (and extended), you'll find a lot of information in "Programming Perl".
    And if you're writing in C, again the O'reilly book "lex and awk".

    http://py-howto.sourceforge.net/regex/regex.html
    http://sitescooper.org/tao_regexps.html

    Hey, I can't help it, that I find the books by that publisher exactly what I'm looking for in my programming needs :)

    --
    Genius doesn't work on an assembly line basis. You can't simply say, "Today I will be brilliant."
    1. Re:sed and regular expressions by den_erpel · · Score: 1

      oeps, in C that has to be "lex and yacc" of course :)

      --
      Genius doesn't work on an assembly line basis. You can't simply say, "Today I will be brilliant."
    2. Re:sed and regular expressions by J'raxis · · Score: 1

      Oreilly also has a Mastering Regular Expressions book.

    3. Re:sed and regular expressions by rhost89 · · Score: 1

      sed -e "s/C/lex and yacc/"

      --
      I will bend your mind with my spoon
  12. How to deal with parsing by Far� · · Score: 3, Informative
    There is a vast literature on parsing, and a lot of (mostly useless) parsing theory is a large part of the cursus in many CS departments. Check CiteSeer to get a glimpse of how much literature there is.

    The best tools to build parsers and manipulate parsed syntax trees are functional languages, such as OCaml (with its streams parsers, camlp4, ocamllex/ocamlyacc, etc.), or SML, Haskell, etc.

    Of course, if you were a LISPer, you'd know that although you have lots of well-known tools to build new parsers, such as Meta or Zebu, the best thing to do about a parser is not to write it, but rather to reuse the builtin extensible universal parser, READ, and its extensible universal unparser, WRITE.

    If you spend most of your time writing parsers, you're not just using the wrong tools, you're also using the wrong approach.

    Just my .2 mg of e-gold worth...

    --

    -- Faré @ TUNES.org
    Reflection & Cybernet

  13. compiler design books and resources. by ncarey · · Score: 3, Informative

    Well...the Dragon book for starters, as mentioned earlier. That's probably the ur-source for most of the theory behind the magic. Makes my head hurt, though.

    Terence Parr's book, Practical Computer Language Recognition and Translation (out of print). His doctoral dissertation is a useful thing too (try the Purdue University library).

    comp.compilers is another useful resource. It's archived at http://compilers.iecc.com.

    Alan Holub's Compiler Design in C is a classic.

    The ACM's SIGPLAN ("Special Interest Group On Programming Languages") and it's journal SIGPLAN Notices of the ACM are all fine resources. So is ACM Transactions on Programming Languages and Systems.

    Don't forget the IEEE as well.

    Not to mention Abelman and Sussman: Structure and Interpretation of Computer Programs.

    The garbage collection page is a good source for information on memory management and garbage collection.

    Your university's library is another good resource.

    Well. That should keep you out of trouble.

    --
    N. --
  14. One course that might be relevant by Covener · · Score: 1

    Finite automata & Natural Language. Study of Finite State Machines progressing to more complicated grammars and languages. At PSU it's a relatively difficult undergrad elective.

    Very informative and much more interesting than most of my other courses. Doesn't teach you lex/yacc (not that hands-on), but you will be better grounded in regular expressions, languages, parsing, etc.

  15. Reverse parsing by ka9dgx · · Score: 3, Interesting
    Things get really interesting when the parser is set up to preserve context so that it can be run backwards, spitting out source from the symbol tables and code segment tokens. You need to preserve whitespace and declaration order, but it's definitely feasible for a language that doesn't use tons of Macros (sorry C/C++ programmers) like Pascal, Basic, etc.

    --Mike--

    1. Re:Reverse parsing by TulioSerpio · · Score: 1

      VisualWorks (Smalltalk) do this, you can tell the compiler to reformat your source, for instance, this is done compiling and decompiling an object method, I think.

      --

      I'm from Argentina: Tango, Asado, Mate, Gaucho, Maradona, YPF

    2. Re:Reverse parsing by blackcoot · · Score: 1

      Having learned the hard way (again, in my compiler class), this is unfeasible for a truly compiled language for a couple simple reasons:

      1) Which symbols get stored in your symbol table? Most people don't export the entire symbol table -- just the global symbols (which usually means only global variables shared across translation units and function declarations). And even then, you aren't guaranteed to get the global symbols -- if you've linked statically or run strip over your executable to make it smaller, you're out of luck.

      2) Loops?
      Assembly like languages don't have loops, they have jumps and logical tests. So how exactly are you planning on matching loop structures, given that they all look essentially identical in assembly?

      There are other, equally hairy problems to be dealt with. If you're talking about an interpreted language, you may have more hope.

      The best you can really hope for is a disassembler and dust off your assembly skillz.

  16. Lex and Yacc by Aniquel · · Score: 2, Informative

    I'm surprised that no one has mentioned lex and yacc. Guys, perl != generic parser. If you're looking to delve into the science of scanners and parsers, from tokenizing to LR1 grammars, look at lex and yacc. Their java equivalents, JLex and CUP, are just as good for that language (and a whole lot easier to use, I might add). OReilly has a small book on using them, and it would probably be a good introduction to the science of parsing - not to mention the foundation of compilers.

    1. Re:Lex and Yacc by ncarey · · Score: 1

      Nobody has mentions lex/yacc, flex/bison or other lexical analyzer/compiler generators because that's not what he asked for. He wanted resources about parsing and the theory behind it. Of course, it begs the question: how can you get through a university CS cirriculum and not learn this stuff? Or even how to use the university library's catalog and find the information. It's not as if its difficult to find this stuff.

      --
      N. --
  17. compilers by inepom01 · · Score: 1

    Much of parsing has to do with regular expressions, which, in, turn, deal with NFAs/DFAs. Most parsing, in the end, comes to these finite automata and has been on most schools' CS curriculi(sp?) for many many years.

  18. Compiler Books, data types, combinator parsing by Tom7 · · Score: 2

    While it's kind of scary that you're a senior with 7 years of experience and don't know this, it should be easy to catch up by reading the beginning of a compilers book. Older compiler books tend to concentrate on parsing a lot, since parsing used to be "most" of the work involved in compiling. Actually, all the theory on how to conserve the last bit of memory using some kind of wacky table is pretty useless these days, but it's there.

    Though perl's built-in regexp support makes its syntax more brief for matching regular grammars, many grammars are not regular (The grammars of most programming languages are context free or worse!). For those grammars, a parser generator like bison can make life a bit easier.

    Then, manipulating parse trees in a language like C or perl is pretty painful once you've used datatypes and pattern matching in a language like SML, O'Caml, or Haskell. If you're going to be doing a lot of work on the parsed data, and its structure is not trivial, give these a shot.

    Finally, about the only thing that's excited me about parsing for years is the idea of parsing combinators: writing parsers that look like the grammar IN the language itself. They can allow the parsing of arbitrary grammars, and don't require resorting to external tools. There's a good beginning explanation in Paulson's "ML for the Working Programmer", though I can't find a tutorial online... maybe it's time to write one!

  19. Theory of Computation by richardalan · · Score: 1

    Of course there's a branch of CS that studies parsing. It's called the Theory of Computation. Any good introductory Theory of Comp. book will give you an introduction to parsing. Any decent book on compiler construction will give you the gory details about implementing parsers. What do they teach CS students these days, if a senior doesn't know this?

    1. Re:Theory of Computation by adamy · · Score: 1

      Wow, somebody stole my Rant!

      Language Theory and Computer Theory go Hand in Hand. RegEx is just the start. Push Down Automata, and Turing Machines are where my Comp Theory course finished up, with the next semester leading to Compilers.

      But that stuff is old hat. Neural Networks and pattern recognition in 3D is much funner.

      --
      Open Source Identity Management: FreeIPA.org
  20. Not... by Tom7 · · Score: 2


    Your needs for parsing regular (or regular-ish) grammars in perl will be filled. Unfortunately, regular expressions are not always the right tool for the job (see the gargantuan e-mail address parser in the back of the MRE book for an example) because many grammars are not regular. Try parsing (say) C with perl! Also, perl has poor facilities for manipulating such parse trees, especially compared to a language with datatypes and pattern matching like SML or O'caml.

    Perl makes parsing flat data brief (though regular expression libraries in other high level languages make it just as easy, if a little more verbose), but it is not by any stretch the be-all and end-all of parsing.

  21. Constructing Little Languages by rickwood · · Score: 1

    The Dragon Book is the standard reference on this subject.

    However, I've found that Constructing Language Processors for Little Languages by Randy M. Kaplan helped me more in constructing small one-off parsers for various purposes. The Dragon Book provides a good grounding in theory, but I've found the utilitarian approach in Little Languages more accessible.

  22. Java Book on Parsers by The+Slashdolt · · Score: 2

    I have found this book very good. I am a Java programmer and I wanted to understand parsing from the bottom up, before using something like ANTLR. This books gives a great foundation with code to explain it all.

    --
    mp3's are only for those with bad memories
  23. JavaCC by dexter+riley · · Score: 1
    Check out JavaCC at WebGain's site:
    "Java Compiler Compiler (JavaCC) is the most popular parser generator for use with Java applications. A parser generator is a tool that reads a grammar specification and converts it to a Java program that can recognize matches to the grammar. In addition to the parser generator itself, JavaCC provides other standard capabilities related to parser generation such as tree building (via a tool called JJTree included with JavaCC), actions, debugging, etc. "
    It's very cool, and there are scores of grammars already defined.
  24. State machines by corrosiv · · Score: 1


    We had 2 courses at school, the first being abstract machines, languages, and computations (regular expressions, grammars, automata, turing machines). The second was application of the first - programming languages and their processors.

    See "Introduction to Languages and the Theory of Computation" by Martin.

  25. Languages and compiles... by tstiehm · · Score: 1

    The front end of every language compiler or interpreter is a parser. Start there. Lex, Yacc and ANTLR are good places to start. Perl was orginally designed to parsing files and most languages have parsing libraries in them but their value can vary greatly.

  26. If you can stomach Java by LoveMe2Times · · Score: 1

    then C++ shouldn't scare you too much. Check out the Spirit parser library. In a nutshell, it's kind of like Lex and Yacc, but without the need for seperate tools, ie it's all done with standard C++. On the theory side, while everybody is mentioning compiler books, I find those to be a little rushed. Get a good book on the theory of computation (sorry, don't remember the name of the one I like :( ). I found their explanations of automata, regular expressions, grammers, etc etc much more elucidating than the compiler books. I think that this is because the comp theory books treat them for their own sake instead of in passing. And, of course, if you're looking for algorithms, Knuth has plenty :)

  27. Branch of computer science dedicated to the study by geophile · · Score: 2

    The following is not meant as a flame.

    This is an odd question coming from a cs/ce senior. What, may I ask, have you been studying, when you should have been in a compiler course?

    I've been out of school for a while, but it is inconceivable that compilers would now be an optional part of any CS curriculum.

  28. Links from my bookmark list... by Capt.Smirk · · Score: 1
    I've browsed other replies, and I think they've missed the following: Just to be redundant though, none of these sites are a replacement for a good low level book, like the Aho, Sethi, Ulman text. Of course, if you are looking to really bake your noodle, go straight to computational linguistics, starting with Chomsky.

    *Smirk*

  29. more than LALR parsers by Anonymous Coward · · Score: 0

    While most of the responses have to do with parsing computer languages, there are also more general parsers -- particularly ones aimed at natural languages. Web search will also find these, but look at Bill Woods' augmented transition networks and the Kay parsing algorithm.

  30. A few tools do it all... by DrCode · · Score: 2

    While there's lots of theory, I've been living for years off the same small set of tools:

    For C/C++, there's YACC/LEX or BISON/FLEX.

    For Java, there's CUP/JFLEX.

  31. Re:Branch of computer science dedicated to the stu by Anonymous Coward · · Score: 0

    They are for the undergrads at San Diego State! I am planning on taking the class because I can't imagine not taking the subject. However, they require systems programming which glosses over compilers along with linkers, loaders, etc.

  32. Rebuttal from a "C" programmer. by Anonymous Coward · · Score: 0

    Okay, lots of you guys are bashing "C" here.
    I say you're wrong. "C" is the ideal tool
    for smashing through some data and finding
    what you want. 90% of the time, cranking up
    vi/emacs and running code through gcc/cc and
    pumping your data and futzing with the data
    in a "gets()" or custom "get_token()" routine
    is just what you want. Yeah, yeah, PERL's fine
    for some stuff. Regardles, "C" is the more powerful tool and run's fast too.

  33. Re:Branch of computer science dedicated to the stu by Anonymous Coward · · Score: 0

    I can't speak for other folks, but at GMU, compilers isn't required -- we're required to take CS330 - Formal Methods and Models. I don't know about the rest of y'all, but I had a hard time staying awake through my automata/logic/language type classes.

  34. Other books by Pseudonym · · Score: 2

    Some have mentioned the dragon book (Aho, Ullman et all) and the tiger book (Appel's "Modern Compiler Implementation"). Both excellent references.

    If you're at all mathematically minded, you should look into a good book on formal languages, such as Harrison's "Introduction to Formal Language Theory", or R. Nigel Horspool's "LR Parsing". (Caution: Don't buy these. Look for them in your local university library.)

    Once you have the basics, you should look into the work of Mark Hopkins. The guy is a genius, and understands more of the theory of parsing than anyone I've ever seen, and has invented some techniques which far surpass the "state of the art" of lex and yacc.

    --
    sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
  35. dumb question by khuber · · Score: 2
    Try using google.com. sheesh

    -Kevin