Next Generation Stack Computing
mymanfryday writes "It seems that stack computers might be the next big thing. Expert Eric
Laforest talks
about stack computers and why they are better than register-based
computers. Apparently NASA uses stack computers in some of their probes. He
also claims that a kernel would only be a few kilobytes large! I wonder if
Windows will be supported on a stack computer in the future?"
Interestingly enough the Microsoft Intermediate Language (MSIL) that .NET apps are compiled to before being JITed into machine code is actually built around a stack based system as well... No doubt porting the .NET Framework over to such a system would be quite easy... and give much in the way of performance boosts (especially on startup).
Of course... that would still depend on a version of Windows for it to run on.
Help Brendan pay off his student loans
Since the dawn of time, the x86 FPU has been organized as a stack, which has been recognized as a mistake by modern computer architects. For one thing, it is hard to get a stack architecture to take advantage of multiple functional units. Only recently, with the development of SSE, 64 bit modes and other additions have we been able to move away from the stack on the x86.
But why do you need out-of-order execution? Well, misses to memory are very expensive these days - it can easily take from 200 to 400 cycles to service a load that misses all the way to main memory. This can have a significant effect on performance. What out-of-order execution does is to allow independent instructions that are younger than the load to execute in parallel with it. Quite often these parallely-executed instruction will generate other misses to main memory, overlapping their latencies. So - latency of loads that miss is still very high, but at the very least the processor is not idle while servicing them (for a good read see "MLP Yes! ILP no!" by Andy Glew)
Itanium and Sparc compensate for the fact that they don't do stuff out-of-order by putting sh*tloads of L2/3 cache on-chip. The cost of a miss is still very high, but it happens much less often. The manufacturing cost of a chip is also much higher.
Note that what NASA is sending into space is "old" tech. The reason - well, cosmic rays are much stronger in outer space, and the smaller the gate, the easier it is for them to flip its state.
P.S. I'm a computer architect.
The Raven
I'm surprised no one's mentioned the transputer.
I always equivocate. Well, almost always.
You're Just Wrong(tm) about that, actually. See BOOST. No, not the masochistic c++ template library (ANYTHING written in C++ is masochistic), Berkeley's Out of Order Stack Thingy. http://www.cs.berkeley.edu/~satrajit/cs252/BOOST.p df
Probably mostly just an accident of history register machines went superscalar first and "won" (mostly, because maybe since stack machines were more efficient, the need for superscalarity didn't hit so early...),. But, in short: stack machines, with similar design overheads to register machines, can extract at least as much concurrency as register machines, maybe more.
Knuth's MMIX architecture uses registers, but the registers themselves are on a register stack. Perhaps this architecture provides the best of both worlds.
What a fool believes, he sees, no wise man has the power to reason away.
Agreed. I work with this guy; he's an idiot.
The 6809 was not only easy and fun to program, 6809 programs tended to benchmark out significantly faster than programs for comperable CPUs like the Z80, 6800 and 8080. If the industry ever decides to scrap the x86 mess -- which they won't -- going back to the 6809 for a starting point might not be a bad idea at all. I once did a plot of measured times for a benchmark where timings were available for a bunch of CPUs (Sieve of Eratosthenes). When you plotted out clockspeed vs word width, all the CPUs from the 8080 to the Cray something or other fell out into an untidy straight line, except for the 6809. There were, as I recall, three different results published for SOE on the 6809 and all three were an order of magnitude faster than they had any reasonable expectation of being based on the hardware's apparent capabilities.
You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
Uh, I guess I'm too daft to get a Ph. D, but it sure seems to me like optimizing on the instruction level with a stack machine is solving the wrong problem.
With a stack machine, running one instruction stream in parallel is very hard, while very easy on a register-based one. But the flip side of this is that on a stack machine running multiple instruction streams in parallel is incredibly easy while *Very* difficult on a register based CPU.
For instance, take "add 1 to each element of this 30-length array" and the optimization to unroll the loop by three:
The stack version can use parallel streams:
push array # "stack[2]"
push 30 # "stack[1]"
push 1 # "stack[0]" of stream #1
push 2 # "stack[0]" of stream #2
push 3 # "stack[0]" of stream #3
push 3 # number of parallel streams to run
fork
loop:
add 1 to mem at (stack[2] + stack[0])
stack[0] += 3
if stack[0] < stack[1] goto loop
join
You'll have to use your imagination to expand the loop body into what it would look like in stack-instructions, but basically the fork pops the number of parallel stacks to run and then the join waits for each parallel stack to complete. Of course in a real implementation you would also push a number of stack elements to copy, etc. Since instruction decoders for stack machines are so simple your cpu can have literally hundreds of them on a die and each one still doing useful work.
The register-based machine will unroll the loop:
set r1 to 30
set r2 to 0
set r3 to array
loop:
set r4 to r3[r2]
set r5 to r3[r2+1]
set r6 to r3[r2+2]
add 1, r4 store in r4
add 1, r5 store in r5
add 1, r6 store in r6
store r4 to r3[r2]
store r5 to r3[r2+1]
store r6 to r3[r2+2]
add 3 to r2
compare r2 to r1, jump to loop
Now try to run that in parallel and you get a couple memory fetches/write overlayed, but mostly it is pretty slow. Just one hiccup in the pipeline and all of the parallelism stops. Now to mention the code to catch the remainder of the loop if not an even multiple.
How come noone has mentioned the language Joy?
I've looked into it a couple times, and it seems pretty neat. In a word, functional concatenation.
Plus, as we all know, functional languages are so much more fun than procedural.
-------
Incite and flee.
An excerpt from a bit longer essay I wrote:
Seastead this.
"Out of order processing has been done in stack machines for years now."
As far as I know, there are no implemented second-gen stack computers that support that feature.
(There have been a few theoretical ones.)
Which ones are you talking about?
none
I did a computer architecture course a number of years ago. One day, we came to the consensus that the X86 architecture was an example of every computer architecture in existence. You want load store: look at all those MOV AX, xxxx instructions. You want register RISC, look at all those registers AX, BX, CX, DX, SI, DI, SP, BP. You want stack based: look at the FPU. You want vector parallel processing, look at those MMX/SSE instructions. You want symmetric multi-processing, look at those dual cores.
...
The course went quickly downhill after this observation. No one could figure out how incorporating every processor architecture into one product was a good thing
If stack-based computing is so great, powerful, and cheap, why aren't IBM PPC, AMD Athlon, Intel Core pick-a-number, and Sun Sparc dueling it out for the best stack-based chip. Why aren't the next-gen game consoles all using it, since Microsoft and Sony at least (Wii is just a faster GC) went to new architectures. Don't tell me no one has ever heard of the concept before. The Burroughs 5500 dates back to the late 1960's. I think there's more here than is being told.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
If a stack machine is that much simpler, couldn't you either have:
- A vast amount of cores for many unrelated threads
- Or: Multiple pipelines and explicit division of instructions into the pipelines?
The second refers to an instruction coding similar to VLIW such that you parallelise the code on multiple stacks but it still shares an instruction/data cache and allows for parallelism without heavy multi-threading at the high-level (and instead having parallelism as a compiler optimization at the low-level).Correct, except 2nd-gen stack machines have a dedicated stack to hold those return addresses, so they never get to memory. Makes for very fast calls and returns. Experiments by Prof. Koopman have shown that for all practical purposes, a return address stack of 16 elements is deep enough. There is such a 16-deep stack (hidden from the programmer) on the Pentium 4 (and the Alpha AXP too I think): http://blogs.msdn.com/oldnewthing/archive/2004/12/ 16/317157.aspx
none
This has been discussed ad nauseam in the computer architecture community, and I repeat: it's not a good idea!
"My opinions are my own, and I've got *lots* of them!"