Slashdot Mirror


GCC Gets PCH Support And New Parser

Screaming Lunatic writes "GCC will finally get precompiled header support which should help with faster compile times. GCC will also be fitted with a new recursive descent parser that fixes more than 100 bugs in GCC. I'm not sure how they decomposed C++ into a context free grammar so that it could be parsed using recursive descent."

20 of 83 comments (clear)

  1. the changelog for 3.4 does a better job by yugami · · Score: 3, Informative

    since the link for the parser is dated 2000 it's a bit confusing as to why this is news. However the 3.4 changelog (yes, 2 versions away for both features) meantions both.
    http://gcc.gnu.org/gcc-3.4/changes.html

  2. ANTLR? by angel'o'sphere · · Score: 5, Informative

    Terrence Parr, the author of the antlr compiler construction tool says that most languages can be parsed with LL-k grammars provided you have a deep enough look ahead (that means 'k' is big enough).

    Basicly: you aer NOT context free but context sensitive, of course.

    Terrence showed that in practice the hughe drawbacks of a look ahead of k does not appear often.

    I would think the typical look ahead for C++ is 1 in over 85% of the cases and 2 for the rest and in rare cases 4 ... however this is a guess(stemming from java parsers which are often LL-1)

    angel'o'sphere

    --
    Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
    1. Re:ANTLR? by bhurt · · Score: 3, Insightful

      The problem is that C++ requires basically an infinite k to parse context free. For example, consider what the compiler does when it sees a set of tokens like:
      a d;

      Now, is this:
      a) a strange but legal expression parsed
      like:
      (a ) d;

      Both are legal interpretations. The only way you can tell is by looking at the type of a before parsing the rest of the expression. And even this may not help. IIRC, the issue may be undecidable, as template names are in a different "namespace" than variable names (i.e. I can have both a template type and a variable with the same name at the same time.

      Note that d being a more complex expression than just a name also allows you to decide. But we're up to a k=7 on the simple case. Make b and c arbitrarily complex expressions themselves, and you can make k need to be arbitrarily large. Or heck, consider:
      a d, e, f;
      (three new variables d, e, and f, or four different expressions combined with the comma operator?).

      And it doesn't matter how infrequent this code construct is. The compiler has to deal with this situation, and situations like it. And deal correctly. C++ simply is not LALR(1) parsable, and likely not really parsable at all. It's a fundamental flaw in the language.

      And this does cost. YACC was developed because it's easier to write and maintain parsers in YACC than hand-rolled parsers for any nontrivial but LALR(1) parsable language. Time and effort will go to the development and maintainance of the compilers. For commercial products, this means higher prices. For free products like GCC, this means fewer new features and fewer bugfixes in other parts of the code.

      Brian

    2. Re:ANTLR? by Pseudonym · · Score: 4, Informative

      The quote from Terrence Parr is misleading.

      While most languages are LL(k) for some k, most grammars are not, and require some massaging to get into a form which LL parsers will accept. The massaged version is invariably not the "most preferred" way to specify the language. In many cases, a compiler compiler (such as ANTLR) can do the massaging automatically, but this is often not the case.

      Both ISO and ARM C++'s grammars, in particular, are inherently ambiguous and require semantic disambiguation no matter what you do.

      --
      sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
  3. Why the extra step? by glenstar · · Score: 3, Interesting
    To create a precompiled header file, simply compile it as you would any other file, if necessary using the -x option to make the driver treat it as a C or C++ header file. You will probably want to use a tool like make to keep the precompiled header up-to-date when the headers it contains change.

    Why, why, why, why? Why can't the header file simply be compiled at the first inclusion and cached somewhere? I know I am bitching about a single step here, but can anyone explain to me the rationale behind this?

    1. Re:Why the extra step? by PD · · Score: 5, Informative

      Because that sort of thing can get screwed up easily and cause all sorts of problems. I'm thinking of how Borland's precompiled headers sometimes goofed up, or my horrible experiences with Sun's cached templates on their C++ compiler. I'd rather explicitly tell the compiler exactly what I want done in terms of precompilation than to let it guess and screw up on its own.

    2. Re:Why the extra step? by PD · · Score: 5, Insightful

      No, object files are built one at a time with a makefile. The correlation would be if we just typed "gcc" in a directory and the compiler built every cpp file into an object. What if I didn't want a file compiled? What if that file was supposed to be copied into a directory after it was built with another tool? In that case, gcc would be doing the wrong thing by building every .o automatically.

      A makefile lets me control the building of each and every .o file myself, allowing for all sorts of things that I might want to do.

      Precompiled headers should work the same way, or they won't be as flexible as the .h files.

    3. Re:Why the extra step? by j7953 · · Score: 5, Interesting
      Why, why, why, why? Why can't the header file simply be compiled at the first inclusion and cached somewhere?

      But that's just what make will do. Why rebuild the same functionality within a different tool? Basically, the reason is (probably, I'm not a GCC developer) the UNIX philosophy of having small tools doing their job. GCC is a compiler and nothing else, make is a tool that decides what needs to be compiled.

      If you want automation, you can always use an IDE (or some other tool) that includes a make equivalent or that creates appropriate makefiles for you.

      --
      Sig (appended to the end of comments I post, 54 chars)
  4. what a coincidence by Anonymous Coward · · Score: 5, Funny

    context free grammar

    This is good for slashdot, which is a grammar-free context!

  5. Standard C++ Easier by Euphonious+Coward · · Score: 5, Interesting
    ISO Standard 14882 C++ is easier to parse than ARM C++. The biggest difference is that the committee eliminated "implicit int" declarations, which eliminated a lot of ambiguities. Requiring typename in templates helped too.

    (OT) Just wait until you see C++0x. It will (probably) support variable definitions like

    auto iter = some_map.begin();
    and figure out a type for iter by looking at the result type from map<>::begin().
    1. Re:Standard C++ Easier by Anonymous+Brave+Guy · · Score: 3, Informative

      You can resolve the ambiguity using the same rules that decide which version you'll be calling (which you have to work out anyway, of course) and take the return type of that version. As long as you've got context -- which you have in the example case -- it's not a problem. In general, you'd need to specify which overload you wanted. It's kind of like explicit template instantiation (and equally horrible if you have to do it).

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
  6. Because implementations just got contributed by sixseve · · Score: 5, Informative

    From the gcc.gnu.org homepage news:

    January 10, 2003

    Geoffrey Keating of Apple Computer, Inc., with support from Red Hat, Inc., has contributed a precompiled header implementation that can dramatically speed up compilation of some projects.

    December 27, 2002

    Mark Mitchell of CodeSourcery has contributed a new, hand-crafted recursive-descent C++ parser sponsored by the Los Alamos National Laboratory. The new parser is more standard conforming and fixes many bugs (about 100 in our bug database alone) from the old YACC-derived parser.

  7. Well done GCC, but.... by peterpi · · Score: 4, Insightful

    ... in my experience, good use of forward declarations (to avoid unrequired chains of #include), combined with simply putting less in each .c file is a lot more effective than adding the complication of precompiled headers into your build process.

    1. Re:Well done GCC, but.... by sohp · · Score: 4, Insightful

      That's exactly the sort of rule of thumb that John S. Lakos talks about in his terrific book, Large-Scale C++ Software Design (ISBN: 0201633620). Basically, pre-compiled headers are for developers who are too lazy or inexperienced to manage inter-module dependencies efficiently.

    2. Re:Well done GCC, but.... by m8pple · · Score: 3, Interesting
      Whilst that's true to a certain extent, pre-compiled headers still definitely have their use just for cutting down the time taken for the compiler to find, load and parse all the system headers, crt headers, stl headers, boost headers etc.

      You could argue that including only the very specific headers you need for each source file is the best way to go, but I think it is a reasonable trade-off to include all these static system/library headers in a precompiled header, then to re-reference the specific headers in the user source code to indicate the dependency explicitly. I totally agree with the stuff in the book about people binding modules together too tightly through inter-dependencies in user headers though (although I'm not convinced by everything he talks about :)

      I usually chuck almost all the stl and alot of boost into my pre-compiled headers when setting up a build, which cuts the full rebuild time at least in half, usually more. I'm always reluctant to do a full rebuild of the same module under g++ as I know it'll take a long time (comparatively). Admittedly the larger each source file is, the less the benefit, but I tend towards lots of small to medium sized source files anyway.

      I'm not sure how the gcc version will do it, but the msvc (boo, hiss etc.) version actually takes a memory dump of the parsed code tree when pre-compiling, then just copies this back into memory for successive compilations, so the speed increase is dramatic. Hopefully the gcc version does something similar (or better). Be nice if it had something similar to ccache built in as well.

    3. Re:Well done GCC, but.... by Lumpish+Scholar · · Score: 5, Interesting
      ... in my experience, good use of forward declarations (to avoid unrequired chains of #include), combined with simply putting less in each .c file is a lot more effective than adding the complication of precompiled headers into your build process.
      My experience is just the opposite.

      Putting less into each .c file (so that changing a .c file requires less to be recompiled) is only useful if most of the code you need to compile is in .c files. Unfortunately, even with forward declarations, every .c file is likely to have thousands (or tens of thousands) of source from all the .h files that are (recursively) included; that's where the bulk of the compiled code is. Unless each of the smaller .c files can include significantly fewer .h files than the larger .c files could (which, in my experience, they can't), then doubling the number of .c files roughly doubles the amount of source code (.c files plus all the .h files per .c file) needed to compile a product.

      I haven't had a lot of luck with precompiled headers, either. (Context: a project with a hundred source files spread across a dozen directories, totalling about fifty thousand lines of source.)

      Best solution I know of for C++: Use as many forward declarations as you can, periodically trim your include directives, and have relatively large .c files. Each includes a lot of .h source, but this reduces the total bulk of what comes out of the preprocessor.

      I know of C++ systems that take a CPU week to build because of these issues!

      Note that Java doesn't have this problem, or the problem of teaching your makefile about header file dependencies. (Not important enough to get all projects to switch from C or C++, but among the reasons that some projects should.)
      --
      Stupid job ads, weird spam, occasional insight at
  8. Re:gcc already has precompiled headers? by Harik · · Score: 3, Informative
    Anonymous Coward Moaned
    Mac OS X talks uses precompiled headers, I thought GCC was already using them.
    ... without reading the article in question.
    Geoffrey Keating of Apple Computer, Inc., with support from Red Hat, Inc., has contributed a precompiled header implementation that can dramatically speed up compilation of some projects.
  9. Recursive Descent / Context Freeness by Tom7 · · Score: 5, Informative

    Just to clarify: A language does not need to be context-free in order to be parsed by a recursive descent parser, because you can augment the recursive functions with extra arguments that provide, well, context. For instance:

    [exp] ::= x | let [dec] in [exp] end | n | print [exp]

    [dec] ::= val x = [exp]

    (where x is the set of variables and n is the set of integer constants)

    This language is context-free, but the following restriction isn't: We say that strings are only in this language if variables aren't used before being declared. Legal:

    let
    val x = 3
    in
    print x
    end

    Illegal:

    let
    val x = 3
    in
    print y
    end

    This language isn't context-free (in the usual sense) but can be parsed easily by a recursive function. That function simply takes with it a list of all the declared variables. (In fact, you can pull this same sort of hack with lex/yac by having the lexer make a call into your code, which keeps a symbol table of variables it has seen as it runs.)

    (If I understand the problem with C and C++ correctly, the difficulty parsing has to do with recognizing a token as a type name or an identifier, so I think this is relevant.)

  10. Re:LL(k)? I thought LALR(1) was "better." by cakoose · · Score: 3, Informative
    A predictive bottom-up LR parser is more powerful than top-down LL.

    In terms of grammars alone, I believe that is somewhat correct but we're talking about a compiler here. LL parsing is often helpful because it can create and use inherited attributes. A top-down parser can perform some of the semantic work that an LR bottom-up parser cannot.

    I also don't think that the recursive call stack should be of much concern because GCC will probably do something like that anyway (though maybe not as fine grained) in the next compiler pass. As said before, it might actually reduce the amount of work.

    What do you mean by 'huge memory requirements (non-deterministic)'? Yes, an LL parser will maintain a deeper stack but the memory usage is by no means unbounded.

    I don't know how you can classify changing GCC's internal structure as 'feature-creep'. LL grammars are usually considered easier to read/understand and despite what some wannabe-macho programmers may think, readability/clarity is good. It is also easier to write meaningful error messages because an LL parser kind of models the way we naturally think.

    Now I'm not saying that LL parsers are better. The workings of LR grammars are extremely interesting (Knuth is a god). However, don't knock it for no reason at all (stick to 2.95? gimme a break). I'm sure the GCC guys know more than both of us about compiler design and have good reasons for their design decisions.

  11. Yes, sort of, with some help by devphil · · Score: 3, Informative


    You don't really want that kind of thing done in the parser, because your refactorer (or whatever) would then have to handle the possibility of incorrect code. The parser is handling syntax (mostly), remember; semantic correctness is checked later.

    You want GCC to parse the code, check the code, and then do something other than generate assembly. To some extent that's being done already (there are command-line options to dump various representations of the source, e.g., -fdump-translation-unit).

    Also, the back-end (code generation) is seperate from the front end (language handling), so if you were to implement a back-end whose "assembly language" output was actually, say, XML, then you would have a C-to-XML, C++-to-XML, Java-to-XML, FORTRAN-to-XML, ObjectiveC-to-XML, and whatever-else-I'm-missing-to-XML converter. Dunno why you'd want such a thing, but you could probably build some kind of program database out of it (which is what IDEs use to do things like function name completion when typing code).

    All of which is independent of the front-end parser.

    --
    You cannot apply a technological solution to a sociological problem. (Edwards' Law)