Optimizing Stack Based Architectures?
An anonymous reader queries: "I'm currently writing a stack oriented interpreter (*ahem* Managed Code Environment) with complimentary compiler (that will be under MIT license), and was wondering if there have been any advances in stack architecture optimization? Some intense Googling turned up this paper, but it seems a bit dated, and focuses mainly on managing local variables, which is inapplicable to me because my interpreter directly supports local vars. Any thoughts or useful links on the topic would be appreciated."
A german company even developed a microprocessor specifically designed for running forth.
Unfortunately there was not much internet in those days and no HTML, but I think that is a good timespan to look at in your research.
Look up Forth in MIT's library archives, it is more likely to throw up stuff than Google.
And if you thought that was boring you obviously havn't read my Journal ;-)
Funny, but it's true, when the call is recursive. The tail call optimization causes a recursive function to execute in constant space, which is a pretty significant optimization.
You can even make this work if you've got a bunch of functions that call each other recursively with a tail call.
For more tips, I'd recommend reading the source code to another stack machine. Outside of some very hard to read dissertations on a dusty shelf, I doubt that the lore of stack machine optimization can be found anywhere else. This is just one of those cases where "did you Gooogle the thing?" just doesn't apply.
This is America, damnit. Speak Spanish!
I looked into stack machine optimization some time ago, and all the material seemed to fall into the category called "stack folding" which basicly amounts to combining loads with ALU ops rather than pushing onto the stack and then performing the op, just do the op as the data won't be used again on the stack. VM optimization as opposed to HW stack machine optimization is a slightly diferent ballgame as you don't have the direct bottleneck issues of the hardware. The benifits of stack machines are that they have one bottleneck and that can be optimized, but this also means that that bottleneck is always there you are really limited in pipelining as to keep the stack valid you need to wait for each operation to complete in sequence. The reason to VM a stack arch is the simplicity, but the stack becomes the bottleneck so it isn't the best solution for speed. I think that Stacks make a lot of sense for HW but I'm unsure if there is a benifit in VM SW, seems to me that a VM RISC arch would remove all the stack maintainence issues and all else being equal being that you are running your VM on a machine that probably has multiple registers that the code using the underlying arch would be faster. The VM wants to be the greatest common denominator of all target arch's which isn't going to be simple but without direct stack support on the uP you will have a fair amount of overhead.
You are asking the wrong question.
Don't get me wrong, stack machines are certainly the simplest and arguably the most elegant & compact format, and certainly nice to use if you final target is mostly JIT/native code.
If you are interpreting, your primary concern is _speed_. Speed in an interpreter on modern architectures (especially the P4!) is determined by almost constant branch mispredictions for every interpreted instruction (assuming a for(;;) switch(..) {..}; central interpreting loop, which sadly is still the fastest portable way of doing things). So your goal is to have the minimum amount of instructions generated for any particular language construct.
Generating code for a register machine can be just as easy as for a stack machine if you have the right infrastructure (assuming you don't work with a bounded set of registers, which you have no need to in an interpreter).
Register machines do much better here than stack machines, I would estimate about 1.5 times less instructions overal. Don't worry so much about the amount of work inside an instruction, aim to do as much as possible at once without branching as your VM design.
It may be even be worth it to integrate struct field and array indirection as part of your opcode, as once you switch to a register machine, "getfield" type instructions will become your most frequent ones. So having say an "add" instruction that can directly address struct fields for its 3 arguments is going to be a big win (i.e. compile a.b = c.d + e.f into a single instruction). By having a pointer in your stack frame that points to itself, you can even do this for single variables in the same instruction.
Once you start interpreting an OO-like or procedural language on this register-based machine you'll find that you have to create a stack anyway in order to save all the registers between function calls. Stack based machine or register based machine - in the end it's a complete wash.