Which Compiler to Extend for a Small Project?
Andreas(R) asks: "While planning the design of my small programming language, and would appreciate some lessons learned from experienced programmers which have already tried this. I was investigating whether to start from an existing compiler and extend it. The compiler will be based on yacc, or bison. The programming language will be interpreted, object oriented and have higher order programming. Perl 1 seems like a decent starting point, as it's yacc based, and 5000 lines of code. Later versions of Perl are too large to get a good understanding of the whole program in a short period of time. Perl also has the right license (GPL). Is Python out of the question for such a project, since it's not GPL? What other small languages can be used instead? How do I go about designing a small programming language in practice, using what I already know about compiler theory?"
While this seems like beaucoup fun, I'd question the need to extend an existing language by altering the compiler. Towards that end, you might want to use LISP or Scheme, as language extension is built into the language. ( See what Paul Graham has to say about the subject)
Look. The source code to Perl 1 is only 5000 lines long. The source code is open and available for anyone who wants to use it for research and investigation as well as for educational purposes. If we had to release all our source code because at one time in our lives we saw some source code that was covered by the GPL, none of us could ever get jobs programming.
The source is there. Use it as a base for your own program. Change it enough that you aren't blatantly copying it. Release it under your own license. If you like the GPL, then do that. Some of us like less restrictive licenses like BSD or the original (non-GPL compatible) Artistic License.
Definitely, go with Perl 1. From the look of it, it seems to have some pretty good foundations to build upon. Taking a look at where the language itself is now, there's much improvement needed in v1. It seems like a pretty good place to start.
Don't be restricted by licenses on small projects, much less ones that are essentially abandoned.
Also available under The Artistic License. You might like it over the GPL.
Perl might not be the best starting point if you want to learn something about desining a programming language. If you're thinking of using that, I'd say you're better off starting from scratch.
;)
Functional languages, especially those with pattern matching primitives (like ML or Haskell), are really good at this kind of program -- in fact, writing compilers and interpreters is really their shining point. I highly recommend using one of these languages rather than C. Lots of undergraduate computer science classes (like 15-212 and 15-312 at my university) write interpreters in functional languages as part of the curriculum. You could try to find some course notes... I speak from experience when I say that this beats the hell out of mucking around in C or C++, not really knowing what you're doing.
you might want to check out ANTLR.
Did you just say "I can only use GPL'd things. Python isn't GPL'd. Can I use it?" I'll assume you meant something like "I want something with a nice license. Does Python have a nice license?" instead. If that's what you meant, you should check out the Python license for yourself. Summary: it's a nice license. It's a certified Open Source license, imposes fewer restrictions than the GPL and is compatible with the GPL.
Python definitely doesn't fit into 5000 lines of source, though, so Perl1 might be a better bet. PyPy is pretty cool, if you're looking for something smallish and Pythonish.
Sorry if you're looking for something else and these comments turn out to be totally useless.
You know, i was wondering, earlier today, if anybody was ever going to make an object oriented Brainfuck. Object BF (of B F++ if you prefer) would be a delightfully horrible lanfuage. And, since it is such a compact language (though I make no promises about programs written in it) you should have no problem acquianting yourself fully with the existing compiler in about the length of a coffe break.
if you have never taken a compiler class before or written a compiler on your own, i suggest the following:
there are several "toy" grammars out there which allow you to do useful stuff (recursion, 'interesting' data structures [i.e. self referential], etc.) without wading through a lot of useless cruft (implementing huge amounts of runtime support, for example). i'd go with one of those. once you're comfortable that you can make one of these learning languages work, then try to hack one of your own.
this all said, good luck! i am by no means an expert on compiler construction (worked on a custom in-house scripting language as an intern a couple years back and had to take compiler classes to satisfy breadth requirements for my m.s. c.s.) but i do hope this is a little bit useful.
"Is Python out of the question for such a project, since it's not GPL?"
You seem to be concerned about the limitations the license puts on you. Python's license is less restrictive than the GPL - read about it (http://python.org/psf/license.html).
One of many things this means is that if you decide the Python License isn't restrictive enough for you, you can relicense the combination of Python plus your changes under the GPL, as long as you adhere to Python's license (leaving its copyright and other required information intact).
You don't really say what sort of problems you're talking about or why you want to build the language.
If you just want to build a language that will teach you about programming languages, look at old fashioned Pascal not Delphi or Kylix or even Turbo Pascal, but good old-fashioned Jensen and Wirth 1974 Pascal.
If you want to design a programming language, the best advice is to write some code in the proposed language. Remember Tony Hoare's rule, and keep it simple. Most programming languages, from Perl to Python to Java 5, suffer from being accumulations of features.
Have a look at Ruby, Modula-n, Oberon, and so forth.
If you're looking to learn lots about programming in general, think about things you want to do, and construct a lanaguage that does them. Icon is a nice example. Look at SNOBOL, if only because you'll appreciate the "five miles through the snow" stories we old farts tell.
Adapting something that is not really what you are making is just taking on a crutch. You get moving faster to start, but become dependant on something that slows you down.
If you have what it takes to write a new language, then you would be best starting from scratch. Read 3-5 codebases, make a list of the things you liked/didin't like and start out on you own. In the long run, having written the thing your self will give you the advantage. You will know intuitively how everyhting works and how extensions will fit in. That will give you a 2x advantage.
Don't be afraid to read over two other existing implementations as you go. Sharing ideas is very important.
An approach I have also taken is re-write a program in parts. You pick a major component, and replace it with what you need. This gives you testing check-points. The more often you get to a working state and test, the less time you will spend debugging overall. If you can look at perl 1 and determine you can add to it to make the parser a super-set of what you want you could start there. Then you can write something that interprets the byte-code output (the subset generated by your language) to what you want, and write your own interpreter. Then you can tackle replacing the parser and byte code generator... With flex and bison, that should be easy enough. But plan to replace the entire thing. Otherwise you will spend a lot of time reworking things that are not really what you want. If you discover a few gems along the way that you want to keep/port all the better, just don't take on any crutches.
Oh, I would recommed the STL if you are in a hurry. I don't know why functional is better, personally it just gave me a headache in school. But some people claim great results with it...
Finally, a real language must be able to compile itself. Or at least generate it's own byte-code in the case of interpreted langauges. Think about it! You could hack perl 1 to generate your byte code, and then write your parser in perl 1 and have a self compiling language.
-- http://thegirlorthecar.com funny dating game for guys
there are plenty of tiny implementations of either. if I were you I'd use a higher level parser/lexical analyzer generator than lex/flex/yacc/bison. also check out the 'Introduction to Compiler Construction With Unix' from prentice hall, sadly out of print, and 'Compilers' by Aho.
First, there are two kinds of small languages:
1. small languages like lua, io, and scheme that are small in the built-in libraries and in the total distro. These three are great places to start- both are small, OOPish, allow higher-order programming by passing classes, objects, functions and methods as objects.
2. Then there are languages that are big in some ways, but small in syntax. Some of these are easier to extend than so-called "little languages." The reason is usually that their syntax is small, in an isolated place, easy to get at, and meant to be modified. The two best examples for this are Smalltalk and Lisp. Both of these languages satisfy your other requirements and really kick ass for extention. Unlike the above languages, the so-called little-languages, most Smalltalk and Lisp dialects have big, useful libraries. Unlike a big fat language like perl or C++, having a useful library doesn't mean that the language is a huge pain in the ass to extend.
Both Lisp and Smalltalk have a number of implementations. I am a big fan of Squeak Smalltalk, though systems like Little Smalltalk or even GNU Smalltalk maybe worth checking out.
A lot of people here have bad feelings about Lisp-like languages. It's a shame, since Scheme, ISLISP (OpenLisp is a great implementation) and Common Lisp are all *very* powerful languages. You can be quite productive with them once you get over the part about whining about parens. But Lisp may very well be the best option here, there is a long history of people writing custom-syntaxes and language extensions. Look up Common Lisp macros- power almost beyond comprehension, a lot of fun to play with, and with an elegance all its own.
There are examples of people writing a C-like syntaxes for various Scheme implementations. IIRC, Gambit-C (a Scheme to C compiler) comes with one. On Cliki, there are a bunch of other alternative Scheme syntaxes listed.
To, one of the big advantages to using a language in the second category is that syntax extension/modification is done in the language itself, rather than in C. With that comes the familiarity of the language you're creating and the other benefits you gain by using a high-level language like Smalltalk or Common Lisp.
Just some thoughts...
Working toward a usable PDA environment in the spirit of Newton OS: Dynapad
As it happens, there already exists a class of languages that are strong at manipulating syntax trees, and at writing parsers. Several of them also support dynamic compilation, meaning that your language implementation can choose when to stop dicking with the syntax tree and instead compile it to native code.
One such language is Steel Bank Common Lisp. SBCL runs on a number of platforms, chiefly Linux. It is a native-code compiler -- contrary to the usual myth about Lisps, SBCL doesn't even contain a Lisp interpreter; it compiles all expressions to native machine code. If that doesn't suit, consider CMUCL, from which SBCL is derived; it includes both an interpreter and compiler.
Common Lisp is designed to be extended. You can customize the compiler by writing compiler-macros, which specify optimizations or special cases for compiled code. You can write your own parser, or extend the built-in syntax with read-macros -- so you can use a mix of built-in syntax and your own, or gradually move from the former to the latter while having a working code base at every step.
Reason One, trivial to write a parser for the core language.
Reason Two, lots of nice literature available online. Check out Shriram Krishnamurthhi's text
Programming Languages: Applications and Interpretation on his website at http://www.cs.brown.edu/~sk
Reason Three, its already conceptually an AST, so you can get involved in the more interesting work sooner
I'm not really sure what your goal here is. Are you wanting to write a compiler for the sheer joy of writing a compiler? That's a good goal, for sure, and I recommend that every programmer write a compiler or two during their careers, or at least some interpreters.
On the other hand, maybe you want to extend an existing language because you need some specific language feature. Also a good goal, but I do want to caution you to evaluate existing languages first; you may find that some language does what you want, or makes it easy to write a language library that does what you want.
Or maybe you want to write an interpreter to script a bigger program. Then I'd say that you may be better off using something that's already there.
In the first case, if you want to learn how to write a compiler, I generally recommend writing Scheme, or some other simple Lisp. Scheme has advantages in that its parse tree representation is obvious (that's why Lisp looks like it does), the structure of an interpreter and of a compiler are quite similar, and it covers the fundamentals of compilers without burdening you with a bunch of cruft. (If this is your first compiler, you may want to leave out continuations and garbage collection for now.) The very excellent book Structure and Interpretation of Computer Programming has what you need: chapter 4 is about writing interpreters (for Scheme, some modified Schemes, and Prolog), and chapter 5 is all about writing a compiler. All the code in SICP is in Scheme, but the book starts from the beginning with (+ 1 1).
SICP may be too academic for your taste, so you may prefer Paradigms of Artificial Intelligence Programming instead. It uses Common Lisp, and has a little bit more practical feel that SICP. Chapter 22 is about writing a Scheme interpreter, and chapter 23 is about writing a Scheme compiler. Unlike SICP, PAIP doesn't cover the garbage collector. PAIP uses Common Lisp, and although it has enough of an introduction to be a "refresher" for somebody already familiar with Lisp, it's not really suitable for learning the language.
It's really simple to write Lisp in Lisp. Indeed, a month ago I wrote one for a comp.lang.lisp post, just to get a silly quine to work! That's why I keep talking about Lisp and Scheme: it's easy to do.
If Scheme isn't your thing, then Pascal may be a good alternative. Don't try to get fancy with heap allocation, pointers, objects, and other new add-ons; I'd start with plain old Wirth-designed Pascal, and get fancy later if you want. Pascal is really designed for a classroom setting. I have a dim memory of Ada being used for this purpose in some books, but I can't speak very authoratively there.
Of course, the definitive book on writing compilers is the Dragon Book. But you may want to be familiar with some basic CS theory about FSMs first.
Now, that's if you want to learn about compilers for the joy of learning about compilers. But what if you just need one particular language feature for your problem, and that's why you want to write a compiler? Well, then I'd suggest you make sure you've looked at a lot of different languages first. Some languages have surprising features that may let you write a small in-language library to do what you need, instead of needing to extend the compiler.
Lisp is a good candidate for this. John Foderaro once described Lisp as "a programmable programming language". You can alter the language to suit the problem at hand, instead of having to work the other way around. At work now, I use Lisp as a bridge between two very different programming languages because I can extend it in both directions to cover what I nee
I highly recommend Lua. It's a brilliant language. The code base is clean, portable, and easy to read.
http://www.lua.org
In fact, I recommend you not create a new language at all and just use Lua. It's a deceptively simple language that allows you to extend it through certain meta-constructs to become pretty much anything you need-- from simple data description to full Object-Orientation.
I once read through the entire dragon book with the intention of creating my own language; I gave it all up when I found Lua.
Good stuff.
Use lisp. Lisp has been called the programmable programming language. Think XSLT for code. You can define your own mini languages that work your way but are themselves valid lisp and can use all the cool lisp stuff.
That said, interpreted languages are stupidly easy to do, so much so that it's hard to learn much -- they forgive every mistake until long after it's too late to fix it. (Witness Perl.) A high-performance language is a Good Thing not particularly because the programs run faster, although that's a nice side benefit, but rather because it enforces a fundamental rigor. There's no faking performance. When you face that kind of problem squarely, God speaks to you, and if you listen carefully enough you can learn deep truths.
But that's not for everybody. Most people have had any desire to learn anything, deep or otherwise, beaten out of them. By all means go for easy comfort, the economic vitality of the nation depends on people like you. Forget I said anything. Garbage collection, bytecodes, regexps, yeah!
Andreas(R):
While you haven't provided enough details to comment in length, I do have some experience with what you're planning. A couple of years back I started a programming system (XPS) which was rather audacious in scope. After two years of working on it, I realized that I too needed a "back end" compilation system. I looked at various alternatives like GCC (too complex), research compilers (low quality), open source virtual machines like Mono (immature at the time). I was quite surprised when I looked at UIUC's LLVM compiler toolkit. I thought it would be just another half-baked compiler system from a University that never got finished when the Ph.D student left. Instead, I found a well designed, working, *toolkit* for compiler construction. While LLVM still lacks some features, its core is very solid and easily extensible. I've been working with it for a year now and its been quite a pleasure. Check it out at http://llvm.cs.uiuc.edu/
Just don't do it. Really. Better minds have tried, mostly unsuccessfully. Yours will be a failure too. You're not smarter than McCarthy or Steele or Sussman or Dahl or Nigaard or Kay or...
Honestly, its easier to write a recursive descent parser by hand for a programming language than you think, and interpreters are ridiculously easy unless you're worried about making it fast, which is way overrated too. It mattered with 640KB of RAM at 20MHz, but these days, its just stupid to care unless you notice its insanely slow.
First off, if you've not found this link: http://compilers.iecc.com/crenshaw/, then I recommend you start with it. While its about writing a compiler, it really help make parsing much clearer.
Scheme is a good language to check out if you want to start with another design(a scheme interpreter can be written in a few hours, even in C, if you're slick, even if you're not, it would be short project to get 90%).
Some other reference material: Parsing Techniques(free online). Also: Modern Compiler Design by the same guys and well worth the investment. Concepts, Techniques, and Models of Programming Languages, teaches kernal theory of language design, and may open your mind to some other techniques you may not be aware of.
Checking out the archives on Lambda The Ultimate would be wise too. Also, if you're in Boston on December the 4th, you might check out the Lightweight Languages Workshop at MIT.
Forget about using a Virtual Machine. That is nice for speed, but it requires a lot of work. Beter make an interpretter that interprets the abstract program tree. I once started doing this for a JavaScript interpretter, but I never came to finish it.
Maybe you should have a look at Parrot? The CVS includes a lot of working and non-working language-implementations which you could have a look at..
I also recommend you to have a look at Lua which is a minimalistic yet beautiful language..
My other account has a 3-digit UID.
Choose something that requires some extra work to get compatible with some other language.
Don't even bother too much to write an initial compiler, and go straight to the horrible task of making something compatible.
While you are making an own language, dealing with compability issues will lead you to more pitfalls than a straight clean 1st order approach of some language than.
If you do as the parent says and "use it as a base" and "change enough so you aren't blatantly copying it" then you are still copying it and restricted in distrution by copyright law.
The GPL may grant you other privileges if you abide by its conditions. (Though AFAIK most PERL is licensed under the PERL Artisitic License which is more permissive than the GPL.)
Sam
blog.sam.liddicott.com
All Turing Complete languages (which BF is) are equivalent. That means that any OO Turing Complete language can be implemented in a non-OO Turing Complete language. Case in point: C++ compiles to assembly/machine code which is certainly not OO.
How to implement an OO language on a Turing Machine: Do something equivalent to how OO is implemented in C, ie. store "function pointers" (tape addresses) along with the data and use indirection to call those every time a virtual method is needed.
HAND.
Write a compiler for parrot.
Lots of people are doing that right now, and you could learn a lot from the included compilers for different languages, all in varying stages of completion.
Stumbling in the dark
I hear slavering of jaws
Eaten by a grue.
A cool site for people interested in languages and their design is: http://lambda-the-ultimate.org/
I agree completely that only looking for GPL code is a mistaken approach, as the GPL is more restrictive than other licenses. You can almost always include BSD code in GPL, but not vice versa. In any case, the GPL is probably not an ideal license if you want your language to be very widely used.
Thirdly, if you want to look at some nice source code and an interesting way of doing things, have a look at Tcl. It's C sources (modulo the regexp package which came from somewhere else) are some of the nicest I have read. Beautifully commented, and very clear to read. Also, Tcl is a smart approach for a language designed to be extensible.
The article's author really needs to state his purpose, though... just for fun? To 'get famous'? For work? As a homework assignment?;-)
http://www.welton.it/davidw/
Consider starting with a simple FORTH interpreter. If you can't live without OO, there are several OO flavours available for it.
Oh, I would recommed the STL if you are in a hurry. I don't know why functional is better, personally it just gave me a headache in school.
Among the many reasons that functional languages are superior for this kind of task is algebraic data types and pattern matching. Since C++ and Java don't even have sum types, the only way to simulate this is with a load of objects and "instanceof" stuff. For example, here's a simple calculator interpreter in ML:
datatype exp =
Int of int
| Plus of exp * exp
| Times of exp * exp
(* if, then, else *)
| Ifzero of exp * exp * exp
(* evaluates an expression, returning an int *)
fun eval (Int i) = i
| eval (Plus (e1, e2)) = eval e1 + eval e2
| eval (Times (e1, e2)) = eval e1 + eval e2
| eval (Ifzero (b, t, f)) = if eval b = 0
then eval t
else eval f
That's it! These sorts of recursive passes over syntax trees is exactly what goes on for the bulk of a compiler or interpreter. Just think about what a mess this would be with the STL.
http://jirc.hick.org/sleep/
Perl like language written in java. About 10K lines of code in its current version.
The original version I wrote in roughly a weekend (2 years ago) after taking a programming languages course with Lisp is about 3K lines of code:
http://jirc.hick.org/download/sleep040602.tar.gz
.. use the only REAL hardcore programming language remaining - machine code
you can do the whole thing in a couple of hundred lines of code.
Plus, you can apply some great optimizations like, tail recursion elimination. support for closures is just like support for objects.
great little language to cut your teeth.
If you know all about compiler theory, then apply it. Use the current best-kept secret in compiler research -- attribute grammars. Then, your language will be modular, extensible, and precise.
FORTH is trivial to implement (in a few hundred lines rather than a few thousand) and can be compiled or interpreted. It is interactive, the parser is completely minimal (all tokens are seperated by spaces with few exceptions) and the compiler/interpreter/system can be extremely compact. The code also runs relatively quickly. FORTH was fairly popular in the days of 8-bit micros and 16-bit minis for these reasons, and is still used in microcontrollers and workstation firmware.
Stick Men
... but AFAICT it doesn't really counter anything I said about TC language equivalence. Granted, the languages might actually be supersets of (classical) TMs, but that's nitpicking... they're still computationally equivalent since their only non-TM mechanism is I/O and they would both posses equivalent I/O mechanisms.
HAND.
... don't enter into it (to paraphrase a famous skit w/John Cleese). :)
The GP was suggesting that OO was somehow not implementable on a TM and I felt compelled to correct him/her/it.
IIRC, it has also been shown that algorithms always stay within their "class" (P, NP, P-space, etc.) when transformed from a TC language to another TC language. So your observation about complexity is only half true in that an O(n) algorithm may turn into an O(n^10), but it will never turn into an exponential-time algorithm when transformed. Of course, the difference is huge in practical terms, but at the same time the algorithm doesn't really become more 'difficult' from a computational perspective.
HAND.
Only people who understand old things very well should attempt to create new things. Otherwise we all end up with a mess of old bad things, and new worse things. The original poster clearly has no clue, as is readily apparent from his inane questions. Are we clear yet?
It mattered with 640KB of RAM at 20MHz
How much CPU power is in inexpensive handheld devices again? I program for one that has 384 KB of RAM at 16.8 MHz. Wouldn't overclocking a handheld device drain the battery faster?
Parrot's in-memory image is huge. The architecture is needlessly complex. Do yourself a favor and look at Lua or Mono instead. Parrot is a solution in search of a problem.
just start to code, abstract it, write code to write code and the language will evolve by itself
People suggesting Lisp/Scheme: Sure, these languages are extensible, but any extension of Lisp/Scheme you can create with macros or whatever will still look like Lisp/Scheme. If the point of this exercise is to design one's own language, one might not want to base the syntax on Lisp's.
People suggesting using various parser generating tools (yacc, bison, antlr,...): A parser is not a compiler. I guess it's good to use the right tool for each part of the job, but parsing is really the least of his problems.
If the point is designing a language for the fun of it, don't bother writing a compiler, write an interpreter. As others have pointed out, writing an interpreter in a sane high-level language like ML (or Scheme or, I suppose, even Java) is really dead easy if you're not worried about performance. It will probably distract you less from the language design than a compiled implementation would, and be easier to modify as you go along. Plus it will be portable, maybe. If your language becomes wildly popular (or if you feel that performance is the only thing standing between your language and wild popularity) you can write a better interepreter, or a JIT, or a compiler, later on when you're more sure of what you're doing.
If the point is writing a compiler for the fun of it, why not do it all yourself from scratch (except maybe for the parser, for which you can use any of the excellent tools recommended by slashdotters if you don't want to do it by hand)? Compilers aren't really that hard (if you know "compiler theory"), especially if you compile to assembly or C and use an assembler or C compiler to go the rest of the way. Compiling to Parrot assembly/bytecode or JVM or .NET bytecode is also a possibility.
Since you say your language has higher-order features, you will want to read up on implementation techniques for functional languages if you haven't already. For instance, check out the 90-minute Scheme-to-C compiler here.
Best of luck,
CJV
If its not assembler its not programming. enough said! Go create your scripting language elsewhere.
We will not have an all volunteer army!....We WILL have an all volunteer army!....Let me rephrase! - George W. Bush
"Does anyone feel a draft? And where is that high pitched whistling sound coming from?" - xxxxxx (upon hearing the previous quote)