GCC Gets PCH Support And New Parser
Screaming Lunatic writes "GCC will finally get precompiled header support which should help with faster compile times. GCC will also be fitted with a new recursive descent parser that fixes more than 100 bugs in GCC. I'm not sure how they decomposed C++ into a context free grammar so that it could be parsed using recursive descent."
Which link is the story?
I've had enough abrasive sigs. Kittens are cute and fuzzy.
since the link for the parser is dated 2000 it's a bit confusing as to why this is news. However the 3.4 changelog (yes, 2 versions away for both features) meantions both.
http://gcc.gnu.org/gcc-3.4/changes.html
Terrence Parr, the author of the antlr compiler construction tool says that most languages can be parsed with LL-k grammars provided you have a deep enough look ahead (that means 'k' is big enough).
... however this is a guess(stemming from java parsers which are often LL-1)
Basicly: you aer NOT context free but context sensitive, of course.
Terrence showed that in practice the hughe drawbacks of a look ahead of k does not appear often.
I would think the typical look ahead for C++ is 1 in over 85% of the cases and 2 for the rest and in rare cases 4
angel'o'sphere
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
One of things sorely missing in C++ is an easy way to analyze source code. Would that new parser simplify this? Can I use it outside of gcc to implement a refactoring tool (for example).
Anyway, the links provide little to no information. Perhaps someone knows more.
Why, why, why, why? Why can't the header file simply be compiled at the first inclusion and cached somewhere? I know I am bitching about a single step here, but can anyone explain to me the rationale behind this?
Mac OS X talks uses precompiled headers, I thought GCC was already using them.
context free grammar
This is good for slashdot, which is a grammar-free context!
ummm... from the article:
This work will happen in early 2001.
am i missing something?
Reinard
Microsoft Visual C++ had precompiled headers at least ten years ago. Y'know... when DOS was all the rage. :)
Is this simply something that nobody cared about and finally got around to doing it - or was it a tricky feature to implement given that GCC already worked quite well enough without it?
(OT) Just wait until you see C++0x. It will (probably) support variable definitions like
and figure out a type for iter by looking at the result type from map<>::begin().From the gcc.gnu.org homepage news:
January 10, 2003
Geoffrey Keating of Apple Computer, Inc., with support from Red Hat, Inc., has contributed a precompiled header implementation that can dramatically speed up compilation of some projects.
December 27, 2002
Mark Mitchell of CodeSourcery has contributed a new, hand-crafted recursive-descent C++ parser sponsored by the Los Alamos National Laboratory. The new parser is more standard conforming and fixes many bugs (about 100 in our bug database alone) from the old YACC-derived parser.
... in my experience, good use of forward declarations (to avoid unrequired chains of #include), combined with simply putting less in each .c file is a lot more effective than adding the complication of precompiled headers into your build process.
Burn karma to make meta-karma?
Great... that will make using C++ templates and stuff a bit nicer...
(Of course, SML and O'Caml (and related languages) have had much more sophisticated type inference for 30 years!)
Just to clarify: A language does not need to be context-free in order to be parsed by a recursive descent parser, because you can augment the recursive functions with extra arguments that provide, well, context. For instance:
::= x | let [dec] in [exp] end | n | print [exp]
::= val x = [exp]
[exp]
[dec]
(where x is the set of variables and n is the set of integer constants)
This language is context-free, but the following restriction isn't: We say that strings are only in this language if variables aren't used before being declared. Legal:
let
val x = 3
in
print x
end
Illegal:
let
val x = 3
in
print y
end
This language isn't context-free (in the usual sense) but can be parsed easily by a recursive function. That function simply takes with it a list of all the declared variables. (In fact, you can pull this same sort of hack with lex/yac by having the lexer make a call into your code, which keeps a symbol table of variables it has seen as it runs.)
(If I understand the problem with C and C++ correctly, the difficulty parsing has to do with recognizing a token as a type name or an identifier, so I think this is relevant.)
A predictive bottom-up LR parser is more powerful than top-down LL. It's in the dragon book. Additionally, a top-down, recursive descent parser (using a stack or recursive calls) has huge, huge memory requirements (non-deterministic). SLR(1)/LALR(1)/LALR(k) parsers are usually table driven, and are essentially finite state machines (FSMs). Shift-Reduce parsers (such as those generated by lex/yacc/flex/bison) use a symbol stack that doesn't grow huge with one-or-more or zero-or-more conditionals, because it reduces. Bottom up parsers tend to push all the previous crap on to the stack, all the way up to the root node. Index operations are several orders of magnitude faster than procedure calls. Sounds like GCC is having a case of feature creep. I'll stick to 2.95 TYVM.
The biggest trick the devil pulled was letting lawyers become politicians so they can write the laws.
In terms of grammars alone, I believe that is somewhat correct but we're talking about a compiler here. LL parsing is often helpful because it can create and use inherited attributes. A top-down parser can perform some of the semantic work that an LR bottom-up parser cannot.
I also don't think that the recursive call stack should be of much concern because GCC will probably do something like that anyway (though maybe not as fine grained) in the next compiler pass. As said before, it might actually reduce the amount of work.
What do you mean by 'huge memory requirements (non-deterministic)'? Yes, an LL parser will maintain a deeper stack but the memory usage is by no means unbounded.
I don't know how you can classify changing GCC's internal structure as 'feature-creep'. LL grammars are usually considered easier to read/understand and despite what some wannabe-macho programmers may think, readability/clarity is good. It is also easier to write meaningful error messages because an LL parser kind of models the way we naturally think.
Now I'm not saying that LL parsers are better. The workings of LR grammars are extremely interesting (Knuth is a god). However, don't knock it for no reason at all (stick to 2.95? gimme a break). I'm sure the GCC guys know more than both of us about compiler design and have good reasons for their design decisions.
You don't need GCC to implement it, nor do you need to wait on C++0x. Every container declares appropriate nested typedefs. Using your example:
The "figure out a type for you" was done when some_map was declared.
You cannot apply a technological solution to a sociological problem. (Edwards' Law)
You don't really want that kind of thing done in the parser, because your refactorer (or whatever) would then have to handle the possibility of incorrect code. The parser is handling syntax (mostly), remember; semantic correctness is checked later.
You want GCC to parse the code, check the code, and then do something other than generate assembly. To some extent that's being done already (there are command-line options to dump various representations of the source, e.g., -fdump-translation-unit).
Also, the back-end (code generation) is seperate from the front end (language handling), so if you were to implement a back-end whose "assembly language" output was actually, say, XML, then you would have a C-to-XML, C++-to-XML, Java-to-XML, FORTRAN-to-XML, ObjectiveC-to-XML, and whatever-else-I'm-missing-to-XML converter. Dunno why you'd want such a thing, but you could probably build some kind of program database out of it (which is what IDEs use to do things like function name completion when typing code).
All of which is independent of the front-end parser.
You cannot apply a technological solution to a sociological problem. (Edwards' Law)
The version of GCC that ships with OS X has been heavily hacked on to make it work better for that operating system. One of the things Apple added was a simple implementation of PCH.
Other companies have done the same thing, many times. One of the customized versions of GCC that Cygnus did for a customer had a (simple) PCH in it, too.
The version being checked in to GCC is the result of all these different implementations coming together.
You cannot apply a technological solution to a sociological problem. (Edwards' Law)
Index operations are several orders of magnitude faster than procedure calls.
Actually a table based solution like bison is slower than implementing the parser directly in code (like recursive descent).
But giving up on LALR parsing is not necessary. You can use recursive ascent. See this article that claims a speedup of 2.5 to 6.5 compared to table based LALR(1).
It's in the dragon book
Don't believe everything you read in books. Especially old books.
I have never heard of the term recursive ascent.
Basically the work mentioned in that technical report says that if you convert a table driven LALR parser into a directly executable function then you can get a 2.5 to 6.5 speed up for the parser.
A little background: The tables in bison and yacc generated parsers encode information about what to do at some particular point in the parse.For example the tables could say state 20 should shift token 1 or 2 but if its token 4 then a reduce should take place. Basically the tables in a LALR parser are state machines.
What we showed was that if you replace the tables with executable code the resulting parser will be faster. And the amount of code that is generated is not too big. So the above example could be encoded as a switch statement with case labels for 1,2 and 4.
Of course there are lot more details that need to be taken care of, but those are mentioned in the paper.
I feel very safe concluding that a directly executable parser would always be faster (for any real world language) than a table based parser. I would similarly expect that the recursive descent parser added to gcc would be faster than the old table driven one.
One final note: In case anyone wants the source code, its not ready. I've been wanting to clean it up and release it to the world for the past 7 years now, but I've not yet managed to find the time!!
Is there any word about the inclusion of Objective-C++ support for GCC within the near future? Would this new parser allow for the easier or quicker inclusion of Objective-C++ support?
While considered ugly by some, Objective-C++ would allow for the easier creation of Objective-C bindings for C++ libraries under platforms supporting the GCC Objective-C compiler. I would sure look forward to Objective-C bindings to libraries such as wxWindows and FOX.
Precompiled headers aren't much use with C, but for any C++ program that uses iostreams or STL it is godsend.
I have no idea whether those libraries could be rewritten to have small headers, it would probably require a working version of the "export" keyword at least, but I don't know if that is enough.
All parsers are state machines in one way or another. Well, my point is that the advantage of LALR/SLR predictive parsers is that they don't have to waste time doing multiple token look-aheads { LL(1), LL(2)... LL(k) }. Certainly C is approximately LL(5). In LALR(1), it just shifts until it finds something that it can reduce. Additionally, error states can be introduced, and have associated actions, e.g., "missing semicolon", or whatever.
Wait a minute, function calls tend to incur penalties on almost every OS & platform combination. Unless of course, there's alot of inline code and globals, then you can probably cheat and not setup any stack frames (except pushing the return address).
But any program you can implement with recursive function calls, you can implement with one function and a stack. Your new task: implement a LALR(1) C parser w/ one function and a heap. A "real" solution, i.e. GCC, would probably involve a stack and FSM code generated by a tool that could do state minimization, the whole reason for using tables to being with.
The biggest trick the devil pulled was letting lawyers become politicians so they can write the laws.