Optimizing Stack Based Architectures?

← Back to Stories (view on slashdot.org)

Optimizing Stack Based Architectures?

Posted by Cliff on Wednesday May 12, 2004 @03:22AM from the saving-bytes-and-increasing-speed dept.

An anonymous reader queries: "I'm currently writing a stack oriented interpreter (*ahem* Managed Code Environment) with complimentary compiler (that will be under MIT license), and was wondering if there have been any advances in stack architecture optimization? Some intense Googling turned up this paper, but it seems a bit dated, and focuses mainly on managing local variables, which is inapplicable to me because my interpreter directly supports local vars. Any thoughts or useful links on the topic would be appreciated."

31 comments

Min score:

Reason:

Sort:

Hi. by Anonymous Coward · 2004-05-12 03:31 · Score: 3, Informative

Tail-call optimization turns a call followed by an exit into a jump.
1. Re:Hi. by Uma+Thurman · 2004-05-12 03:41 · Score: 5, Interesting
  
  Funny, but it's true, when the call is recursive. The tail call optimization causes a recursive function to execute in constant space, which is a pretty significant optimization.
  
  You can even make this work if you've got a bunch of functions that call each other recursively with a tail call.
  
  For more tips, I'd recommend reading the source code to another stack machine. Outside of some very hard to read dissertations on a dusty shelf, I doubt that the lore of stack machine optimization can be found anywhere else. This is just one of those cases where "did you Gooogle the thing?" just doesn't apply.
  
  --
  This is America, damnit. Speak Spanish!
2. Re:Hi. by beerman2k · 2004-05-12 04:42 · Score: 1
  
  Umm... it also works when the call isn't recursive.
3. Re:Hi. by RAMMS+EIN · 2004-05-12 06:48 · Score: 1
  
  True, but recursion is the typical case in which you win a lot by it.
  
  --
  Please correct me if I got my facts wrong.
4. Re:Hi. by Daleks · 2004-05-12 12:49 · Score: 3, Informative
  
  Funny, but it's true, when the call is recursive. The tail call optimization causes a recursive function to execute in constant space, which is a pretty significant optimization.
  
  The algorithm will only operate in constant space correctly if the recursive call doesn't pass a variable by value on the stack that indirectly references another value currently on the stack. Meaning, don't pass a pointer to a stack allocated variable to a tail-call optimized recursive function that will overwrite the current functions stack activation frame. This (and pointer aliasing) are why C can't as a general rule tail-call optimize. A language such as Java can do it, but tail-call optimization sets the stack pointer to the original stack pointer of the preceeding call. Thus giving the current function to the last function's stack address space, which can be a security problem.
5. Re:Hi. by ralejs · 2004-05-14 03:44 · Score: 1
  
  A language such as Java can do it, but tail-call optimization sets the stack pointer to the original stack pointer of the preceeding call. Thus giving the current function to the last function's stack address space, which can be a security problem.
  
  Indeed it is problematic to mix stack inspection as in Java with tail recursion (Here I'm refering to languages like Scheme which promise to give you tail recursion.) But they can actually work together. See the paper "A Tail-Recursive Machine with Stack Inspection" on the following page:
  http://www.ccs.neu.edu/scheme/pubs/cf-toplas04.pdf
GNU Vmgen by Carl · 2004-05-12 03:31 · Score: 4, Informative

If you haven't looked at it then I would recommend GNU Vmgen, a virtual machine interpreter generator. http://savannah.gnu.org/projects/vmgen/

It comes with a nice manual that is an interesting read even if you are writing your interpreter by hand: http://www.complang.tuwien.ac.at/anton/vmgen/html- docs/
Forth archives by MrIrwin · 2004-05-12 03:33 · Score: 3, Interesting

The document may be dated but back in the early 80's a lot of versions of Forth were produced.
A german company even developed a microprocessor specifically designed for running forth.
Unfortunately there was not much internet in those days and no HTML, but I think that is a good timespan to look at in your research.
Look up Forth in MIT's library archives, it is more likely to throw up stuff than Google.

--
And if you thought that was boring you obviously havn't read my Journal ;-)
citeseer by blueAcid · 2004-05-12 04:11 · Score: 2, Informative

I have no first hand knowledge in the subject, but perhaps this link to citeseer will help.
-blue
Complimentary Compiler by mithras+the+prophet · 2004-05-12 04:41 · Score: 5, Funny

A complimentary compiler, huh? I agree, after a few frustrating hours with GCC and its nasty little invectives, that would be a great technology to develop...
[mithras ~/src] alias cc="complimentary_compiler" [mithras ~/src] cc myprog.c myprog.c:7: In function `main': myprog.c:7: compliment: Wow, great job with those open and close braces! myprog.c:7: compliment: They really look wonderful, perfectly indented. myprog.c:83: In function `calculate_stuff': myprog.c:83: compliment: Ingenious coding here. myprog.c:83: compliment: Your use of the unused high bits of this array is pretty sweet, nice work. myprog.c:125: In function `foo': myprog.c:125: error: This line looks like it will be very powerful, myprog.c:125: error: but I can't quite figure it out. Perhaps you forgot a semicolon? myprog.c:125: error: Don't worry about it, I do that all the time. myprog.c:125: error: You've already been so productive today, you deserve a break!

--
four nine eighteen twenty-7 thirty-nine forty-7 fiftyeight sixty-nine seventy-9 eighty-8 one-hundred-and-nine one-twenty
Stack Folding by pagercam2 · 2004-05-12 04:52 · Score: 3, Interesting

I looked into stack machine optimization some time ago, and all the material seemed to fall into the category called "stack folding" which basicly amounts to combining loads with ALU ops rather than pushing onto the stack and then performing the op, just do the op as the data won't be used again on the stack. VM optimization as opposed to HW stack machine optimization is a slightly diferent ballgame as you don't have the direct bottleneck issues of the hardware. The benifits of stack machines are that they have one bottleneck and that can be optimized, but this also means that that bottleneck is always there you are really limited in pipelining as to keep the stack valid you need to wait for each operation to complete in sequence. The reason to VM a stack arch is the simplicity, but the stack becomes the bottleneck so it isn't the best solution for speed. I think that Stacks make a lot of sense for HW but I'm unsure if there is a benifit in VM SW, seems to me that a VM RISC arch would remove all the stack maintainence issues and all else being equal being that you are running your VM on a machine that probably has multiple registers that the code using the underlying arch would be faster. The VM wants to be the greatest common denominator of all target arch's which isn't going to be simple but without direct stack support on the uP you will have a fair amount of overhead.
1. Re:Stack Folding by AnimalCoward · 2004-05-14 02:32 · Score: 1
  
  Dude, can you get rid of the run-on sentences? You use commas like periods, which makes your thought process very hard to follow.
use a register machine by Aardappel · 2004-05-12 07:02 · Score: 5, Interesting

You are asking the wrong question.

Don't get me wrong, stack machines are certainly the simplest and arguably the most elegant & compact format, and certainly nice to use if you final target is mostly JIT/native code.

If you are interpreting, your primary concern is _speed_. Speed in an interpreter on modern architectures (especially the P4!) is determined by almost constant branch mispredictions for every interpreted instruction (assuming a for(;;) switch(..) {..}; central interpreting loop, which sadly is still the fastest portable way of doing things). So your goal is to have the minimum amount of instructions generated for any particular language construct.

Generating code for a register machine can be just as easy as for a stack machine if you have the right infrastructure (assuming you don't work with a bounded set of registers, which you have no need to in an interpreter).

Register machines do much better here than stack machines, I would estimate about 1.5 times less instructions overal. Don't worry so much about the amount of work inside an instruction, aim to do as much as possible at once without branching as your VM design.

It may be even be worth it to integrate struct field and array indirection as part of your opcode, as once you switch to a register machine, "getfield" type instructions will become your most frequent ones. So having say an "add" instruction that can directly address struct fields for its 3 arguments is going to be a big win (i.e. compile a.b = c.d + e.f into a single instruction). By having a pointer in your stack frame that points to itself, you can even do this for single variables in the same instruction.
Barking up the wrong tree by Anonymous Coward · 2004-05-12 08:33 · Score: 1, Informative

You need to read through comp.compilers.
It's not just about the VM by ltratt · 2004-05-12 10:07 · Score: 1, Informative

In the paper referenced, the claim is that 1 in 4 instructions in a Forth VM is a DUP. That instantly makes me think that either the instruction set is horribly designed, or the compiler was generating massively inefficient code. Possibly both. The possible lesson to be learned from this is that, regardless of whether you have a stack or register based VM, other design decisions are likely to have at least as great an impact on performance.

There's also the possibility to use a hybrid machine that's mostly stack based, but if - for some reason - you find yourself DUP'ing or SWAP'ing a lot, having a few registers knocking around may solve that problem very nicely, without turning the whole thing into a register based machine. i.e. you can bolt it on in a mostly backwards compatible, low effort, fashion if you find you have the need.
1. Re:It's not just about the VM by mparker762 · 2004-05-12 11:28 · Score: 2, Informative
  
  Not necessarily -- MOV's are extremely common in code generated for register machines -- data movement is simply a very common activity.
2. Re:It's not just about the VM by ltratt · 2004-05-13 05:42 · Score: 2, Insightful
  
  MOV's are typically very cheap, stack operations typically aren't. At the risk of generalizing, I'll assert that generally stack based VMs are built with a particular target language in mind, and that register machines aren't. With a register machine you're more likely to be playing by someone elses rules.
  
  IME if you develop and analyse your instruction set carefully, you can often reduce the amount of 'redundant' instructions necessary in a stack based solution. Let me give you a trivial example. In something I'm working on, I removed a couple of pages of instructions (huge amounts of DUP'ing and SWAP'ing amongst them) that were generated to analyse and assign a functions arguments in a dynamic language, and replaced them with a single instruction. Yes, I've bloated the instruction set by doing so. Yes, I increased performance an awful lot, and also made the compiler a *lot* easier to understand at the expense of a very small addition to the VM. Because of the frequency that particular op is called, and the saving involved, that sort of design choice is often where you can make the useful big savings.
searching papers: portal.acm.org by hubertf · 2004-05-12 11:32 · Score: 2, Insightful

Try giving your query at http://portal.acm.org/, they return quite a bunch of articles, dunno how many of them are relevant. Download of article text may cost, though...

- Hubert
Et tu Brute? by Anonymous Coward · 2004-05-12 12:00 · Score: 0

Hahaha. Guess what? Yet another AC is writing a bytecode interpreter/compiler. Wheee.... I feel so unique!

Something tells me I might know you, and I think you're just trying to get my secrets. I'm not going to tell you that I used a ton of inlining, tail call optimization and an opcode that combines load/operator/store to improve the speed of my interpreter by about 10x. Oh wait -- you tricked me! Drat! Well you're not going to get me to tell about how I'm getting ready to do JIT compilation to improve that by another 10x. Aaaaaaargh. Not again. *sigh* Outwitted by an anonymous coward. I feel so ashamed. *sob*

p.s. If I submitted my interpreter to the language shootout page today, it would debut at #10 -- without a JIT compiler (see list below). My goal is to debut at #5-6 with a JIT compiler. Wanna race? :)

1. C (gcc) [compiled]
2. Ocaml (ocaml) [compiled]
3. SML (mlton) [compiled]
4. C++ (g++) [compiled]
5. SML (smlnj) [compiled]
6. Common Lisp (cmucl) [compiled]
7. Scheme (bigloo) [compiled]
8. Ocaml (ocamlb) [interpreted]
9. Java (java) [compiled]
10. Pike (pike) [interpreted]
11. Forth (gforth) [interpreted]
12. Lua (lua) [interpreted]
13. Python (python) [interpreted]
14. Perl (perl) [interpreted]
15. Ruby (ruby) [interpreted]
1. Re:Et tu Brute? by beef3k · 2004-05-13 05:35 · Score: 2, Informative
  
  You seem to have left out that Python has JIT these days...
2. Re:Et tu Brute? by k4_pacific · 2004-05-13 13:46 · Score: 1
  
  I just created a clone combining the DNA of Ken Thompson and Julius Caesar. It's first words were:
  
  When in doubt, et tu Brute force!
  
  wakka wakka wakka! Is this thing on? Yes, I'll be here all week ladies and gentlemen!
  
  --
  Unknown host pong.
3. Re:Et tu Brute? by Anonymous Coward · 2004-05-13 15:14 · Score: 0
  
  Yes, I've heard of psyco. But it's not on the shootout website, so I didn't count it. ;)
  
  And since my interpreter is already several times faster than the python interpreter, I have no doubt that beating its JIT will be a walk in the park.
Sure - if you never call functions by Anonymous Coward · 2004-05-12 14:48 · Score: 1, Interesting

Once you start interpreting an OO-like or procedural language on this register-based machine you'll find that you have to create a stack anyway in order to save all the registers between function calls. Stack based machine or register based machine - in the end it's a complete wash.
MIT License? by Anonymous Coward · 2004-05-12 15:48 · Score: 0

The MIT license sucks. Hard.

Such a license came from a time when software patents were just a glint in a lawyer's eye. You're writing an optimizing stack interpreter. This is something which could easily have patentable features in it. Even if you're not planning to patent anything, your licensees need to be protected from your (theoretical) future tyranny.

Thus you need to include a patent release in your license. Something along the lines of "the author of the software grants you a non-exclusive, non-revokable worldwide license to use any patents fully or partly owned or controlled by the author, but only to the extent that such use is necessary to operate the Software".
Me too! by pfafrich · 2004-05-13 06:15 · Score: 1

For what its worth I've been writing an interpreter for mathematical equations in Java. The principal problem here is handeling different types of vectors and matricies (i.e. 2,3 & 4 dimensional vectors, 2 by 2 matricies etc). For each datatype I've used a different stack, so three different stacks for the three different vector types, and 9 stacks for matricies. Theres also corresponding heeps for each data type which are allocated during the compilation stage, eliminating the need for any object creation during evaluation.

--
There are four sorts of people in the world: fools, lunatics, idiots and morons. - Umberto Eco, Foucaut's pendulum.
Don't use stack by Anonymous Coward · 2004-05-13 06:32 · Score: 0

Use virtual register file, like Tao/Amiga VM. Much nicer, but probably patented in the US of Idiocy.
Re:searching papers: citeseer.ist.psu.edu by fprog · 2004-05-13 17:14 · Score: 1, Informative

You should try citeseer or one of its mirror too.
Most paper are free for download.
IEEE is another website or ACM queue.

http://citeseer.ist.psu.edu/cis?cs=1&q=stack+based +virtual+machine
Look at Lua 4 vs. Lua 5 VM by Anonymous Coward · 2004-05-14 03:51 · Score: 3, Informative

Lua switched from a stack based VM to a (semi-unbounded) register based VM for Lua 5. The speed improvement is quite noticeable (>30% if you do not account for time spent in C libraries). And there is still room for improvement if you are willing to drop portability and/or debugging/tracing.

The source code is very concise and reads nicely. The VM interpreter core loop is < 400 lines in src/lvm.c. Read this along with src/lopcodes.h.

Please read and understand both Lua 4 and Lua 5 VM core BEFORE designing your own VM. Go and download the code at www.lua.org/ftp/, it's less than 200K each.

BTW: It has tail calls, too.
Stack Machines - Burroughs / Unisys A Series by astrojetsonjr · 2004-05-15 00:06 · Score: 1

One of the early stack machines was the Burroughs computers from the 60's (B5000/B6000/B7000/A Series). The machines were pretty fast and had some interesting things they did in hardware. In the 80's Unisys had developed an A Series emulator that ran on a PC.
From what I know today, the A Series emulator runs on big Intel clusters with very good performance.
You might want to extend your Google search to include the Burroughs/Unisys stack machines. A trip over to comp.sys.unisys with your question may get a response from people inside that have done stack based emulators for years and years.
Why optimize? by Anonymous Coward · 2004-05-21 00:47 · Score: 0

If you want to use an interpreted stack-based architecture you're already 1.5 orders of magnitude slower than machine code. What's the point of spending a lot of time making that 1.4 orders of magnitude?