Next Generation Stack Computing

.NET Compatibility by DaHat · 2006-08-10 07:26 · Score: 4, Interesting

Interestingly enough the Microsoft Intermediate Language (MSIL) that .NET apps are compiled to before being JITed into machine code is actually built around a stack based system as well... No doubt porting the .NET Framework over to such a system would be quite easy... and give much in the way of performance boosts (especially on startup).

Of course... that would still depend on a version of Windows for it to run on.

--
Help Brendan pay off his student loans

X86 FPU's finally losing their stackness by GGardner · 2006-08-10 07:31 · Score: 4, Interesting

Since the dawn of time, the x86 FPU has been organized as a stack, which has been recognized as a mistake by modern computer architects. For one thing, it is hard to get a stack architecture to take advantage of multiple functional units. Only recently, with the development of SSE, 64 bit modes and other additions have we been able to move away from the stack on the x86.

Re:X86 FPU's finally losing their stackness by Anonymous Coward · 2006-08-10 07:50 · Score: 2, Interesting

we been able to move away from the stack on the x86
As someone who has written several Forth compilers for the x86 I'd like to point out that the design of the stacks on the x86 is very inefficient. The main stack is just that: a stack tied to one particular register. The FP stack was just a joke; a dead weasel could have designed something better. Anyway, I do like using Forth even under the x86 model - it's nice to know that my entire programming system fits into the on-die cache!

No, they're not better by vlad_petric · 2006-08-10 08:01 · Score: 3, Interesting

I didn't even try to torrent-download the thing, but I can tell you why stack machines aren't better than register-based ones. The main reason is that it's much much much harder to do renaming of a stack than of"regular" registers. Without renaming you can't do out-of-order execution ... Currently, there are two "mainstream" architectures that include stacked register files: Itanium and Sparc. Neither of them have out-of-order implementations.

But why do you need out-of-order execution? Well, misses to memory are very expensive these days - it can easily take from 200 to 400 cycles to service a load that misses all the way to main memory. This can have a significant effect on performance. What out-of-order execution does is to allow independent instructions that are younger than the load to execute in parallel with it. Quite often these parallely-executed instruction will generate other misses to main memory, overlapping their latencies. So - latency of loads that miss is still very high, but at the very least the processor is not idle while servicing them (for a good read see "MLP Yes! ILP no!" by Andy Glew)

Itanium and Sparc compensate for the fact that they don't do stuff out-of-order by putting sh*tloads of L2/3 cache on-chip. The cost of a miss is still very high, but it happens much less often. The manufacturing cost of a chip is also much higher.

Note that what NASA is sending into space is "old" tech. The reason - well, cosmic rays are much stronger in outer space, and the smaller the gate, the easier it is for them to flip its state.

P.S. I'm a computer architect.

--

The Raven

Re: transputer wikipedia link by Kevster · 2006-08-10 08:09 · Score: 2, Interesting

I'm surprised no one's mentioned the transputer.

--
I always equivocate. Well, almost always.

Re:NASA by Medievalist · 2006-08-10 08:18 · Score: 2, Interesting

I thought NASA was using components the rest of the world treated as obsolete due their proven durability and reliability in the radiation of space.

Essentially correct. It is so costly and time-consuming to get a new component certified for use that it's usually less work to find a clever way to use old components. Then ten months after launch production ceases on the old part, and you have to have special ones built at a hundred times the cost (military option) or scavenge them on eBay (scientific option).

Re:Who likes them? by Anonymous Coward · 2006-08-10 08:39 · Score: 2, Interesting

You're Just Wrong(tm) about that, actually. See BOOST. No, not the masochistic c++ template library (ANYTHING written in C++ is masochistic), Berkeley's Out of Order Stack Thingy. http://www.cs.berkeley.edu/~satrajit/cs252/BOOST.p df

Probably mostly just an accident of history register machines went superscalar first and "won" (mostly, because maybe since stack machines were more efficient, the need for superscalarity didn't hit so early...),. But, in short: stack machines, with similar design overheads to register machines, can extract at least as much concurrency as register machines, maybe more.

MMIX uses a register stack by bunratty · 2006-08-10 09:06 · Score: 2, Interesting

Knuth's MMIX architecture uses registers, but the registers themselves are on a register stack. Perhaps this architecture provides the best of both worlds.

--
What a fool believes, he sees, no wise man has the power to reason away.

Re:Stop Hurting My Eyes by Anonymous Coward · 2006-08-10 09:11 · Score: 1, Interesting

Agreed. I work with this guy; he's an idiot.

Re:Appropriate instruction set by vtcodger · 2006-08-10 09:23 · Score: 2, Interesting

***Then again I could cite my old favorite, the 6809.***

The 6809 was not only easy and fun to program, 6809 programs tended to benchmark out significantly faster than programs for comperable CPUs like the Z80, 6800 and 8080. If the industry ever decides to scrap the x86 mess -- which they won't -- going back to the 6809 for a starting point might not be a bad idea at all. I once did a plot of measured times for a benchmark where timings were available for a bunch of CPUs (Sieve of Eratosthenes). When you plotted out clockspeed vs word width, all the CPUs from the 8080 to the Cray something or other fell out into an untidy straight line, except for the 6809. There were, as I recall, three different results published for SOE on the 6809 and all three were an order of magnitude faster than they had any reasonable expectation of being based on the hardware's apparent capabilities.

--
You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey

Re:Stack - bad for speed, good for low power by 0xABADC0DA · 2006-08-10 10:14 · Score: 2, Interesting

Uh, I guess I'm too daft to get a Ph. D, but it sure seems to me like optimizing on the instruction level with a stack machine is solving the wrong problem.

With a stack machine, running one instruction stream in parallel is very hard, while very easy on a register-based one. But the flip side of this is that on a stack machine running multiple instruction streams in parallel is incredibly easy while *Very* difficult on a register based CPU.

For instance, take "add 1 to each element of this 30-length array" and the optimization to unroll the loop by three:

The stack version can use parallel streams:

push array # "stack[2]"
push 30 # "stack[1]"
push 1 # "stack[0]" of stream #1
push 2 # "stack[0]" of stream #2
push 3 # "stack[0]" of stream #3
push 3 # number of parallel streams to run
fork
loop:
add 1 to mem at (stack[2] + stack[0])
stack[0] += 3
if stack[0] < stack[1] goto loop
join

You'll have to use your imagination to expand the loop body into what it would look like in stack-instructions, but basically the fork pops the number of parallel stacks to run and then the join waits for each parallel stack to complete. Of course in a real implementation you would also push a number of stack elements to copy, etc. Since instruction decoders for stack machines are so simple your cpu can have literally hundreds of them on a die and each one still doing useful work.

The register-based machine will unroll the loop:

set r1 to 30
set r2 to 0
set r3 to array
loop:
set r4 to r3[r2]
set r5 to r3[r2+1]
set r6 to r3[r2+2]
add 1, r4 store in r4
add 1, r5 store in r5
add 1, r6 store in r6
store r4 to r3[r2]
store r5 to r3[r2+1]
store r6 to r3[r2+2]
add 3 to r2
compare r2 to r1, jump to loop

Now try to run that in parallel and you get a couple memory fetches/write overlayed, but mostly it is pretty slow. Just one hiccup in the pipeline and all of the parallelism stops. Now to mention the code to catch the remainder of the loop if not an even multiple.

Joy! by selfdiscipline · 2006-08-10 11:09 · Score: 2, Interesting

How come noone has mentioned the language Joy?
I've looked into it a couple times, and it seems pretty neat. In a word, functional concatenation.
Plus, as we all know, functional languages are so much more fun than procedural.

--

-------
Incite and flee.

A Near Miss for Stack Computing Circa 1981 by Baldrson · 2006-08-10 11:12 · Score: 5, Interesting

Stack computing came close to changing the course of the computer industry, including setting networking forward 15 years (displacing Microsoft's stand-alone approach to software) back in 1981.

An excerpt from a bit longer essay I wrote:

In August 1980, Byte magazine published its issue on the Forth programming language
At that time, I was working with Control Data Corporation's PLATO project, pursuing a mass market version of that system using the Intelligent Student Terminal (IST). The IST's were Z80 processor terminals sporting 512*512 bit mapped displays with touch sensitive screens and 1200bps modems that went for about $1500. We were shooting for, and actually successfully tested, a system that could support almost 8,000 simultaneous users on 7600-derived Cybers (the last machine designed by Seymour Cray to be marketed by CDC --with 60 bits per word, 6 bits per character, no virtual memory, but very big and very fast) with under 1/4 second response time (all keys and touch inputs went straight to the central processor) for $40/month flat rate including terminal rental. Ray Ozzie had been working at the University of Illinois on offloading the PLATO central system to the Z80 terminal through downloaded assembly language programming, doing exotic things like "local key echo" and such functions.
I was interested in extending Ray's work to offload the mass-market version of the PLATO central system. In particular I was looking at a UCSD Pascal-based approach to download p-code versions of terminal functions -- and even more in particular the advanced scalable vector graphics commands of TUTOR (the "relative/rotatable" commands like rdraw, rat, rcircle, rcircleb, etc.) if not entire programs, to be executed offline. Pascal was an attractive choice for us at the time because CDC's new series of computers, the Cyber 180 (aka Cyber 800) was to have virtual memory, 64 bit words, 8 bit characters and be programmed in a version of the University of Minnesota Pascal called CYBIL (which stood for Cyber Implementation Language). Although this was a radically different architecture than that upon which PLATO was then running, I thought it worthwhile to investigate an architecture in which a reasonable language (you should have seen what we were used to!) could be made to operate on both the server and the terminal so that load could be dynamically redistributed. This idea of dynamic load balancing would, later, contribute to the genesis of Postscript.
Over one weekend a group of us junior programmers managed to implement a good portion of TUTOR's (PLATO's authoring language) advanced graphics commands in CYBIL. Our little hunting pack at CDC 's Arden Hills Operations was in a race against the impending visit of Dave Anderson of the University of Illinois' PLATO project who was promoting what he called "MicroTUTOR". Anderson was going to take the TUTOR programming language and implement a modified version of it for execution in the terminal -- possibly in a stand-alone mode. Many of us didn't like TUTOR, itself, much. Indeed, I had to pull teeth to get the authorization to put local variables into TUTOR -- and we were determined to select a better board from our quiver with which to surf Moore's Shockwave into the Network Revolution. CDC management wasn't convinced that such a radical departure from TUTOR would be wise, and we hoped to demonstrate that a p-code Pascal approach could accomplish what microTUTOR purported to -- and more. We quickly ported a TUTOR central sy

--
Seastead this.

Re:Stop Hurting My Eyes by Eric+LaForest · 2006-08-10 11:57 · Score: 2, Interesting

"Out of order processing has been done in stack machines for years now."

As far as I know, there are no implemented second-gen stack computers that support that feature.
(There have been a few theoretical ones.)
Which ones are you talking about?

--
none

The X86 is an example of everything! by Cassini2 · 2006-08-10 12:28 · Score: 3, Interesting

I did a computer architecture course a number of years ago. One day, we came to the consensus that the X86 architecture was an example of every computer architecture in existence. You want load store: look at all those MOV AX, xxxx instructions. You want register RISC, look at all those registers AX, BX, CX, DX, SI, DI, SP, BP. You want stack based: look at the FPU. You want vector parallel processing, look at those MMX/SSE instructions. You want symmetric multi-processing, look at those dual cores.

The course went quickly downhill after this observation. No one could figure out how incorporating every processor architecture into one product was a good thing ...

If Stack-Based Computing Is So Great... by Nom+du+Keyboard · 2006-08-10 16:13 · Score: 2, Interesting

If stack-based computing is so great, powerful, and cheap, why aren't IBM PPC, AMD Athlon, Intel Core pick-a-number, and Sun Sparc dueling it out for the best stack-based chip. Why aren't the next-gen game consoles all using it, since Microsoft and Sony at least (Wii is just a faster GC) went to new architectures. Don't tell me no one has ever heard of the concept before. The Burroughs 5500 dates back to the late 1960's. I think there's more here than is being told.

--
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."

Parallalism by Peaker · 2006-08-11 00:15 · Score: 2, Interesting

How about achieving parallelism by using multiple stacks?

If a stack machine is that much simpler, couldn't you either have:

A vast amount of cores for many unrelated threads
Or: Multiple pipelines and explicit division of instructions into the pipelines?

The second refers to an instruction coding similar to VLIW such that you parallelise the code on multiple stacks but it still shares an instruction/data cache and allows for parallelism without heavy multi-threading at the high-level (and instead having parallelism as a compiler optimization at the low-level).

Re:Twelfth of Never by Eric+LaForest · 2006-08-11 06:55 · Score: 2, Interesting

Correct, except 2nd-gen stack machines have a dedicated stack to hold those return addresses, so they never get to memory. Makes for very fast calls and returns. Experiments by Prof. Koopman have shown that for all practical purposes, a return address stack of 16 elements is deep enough. There is such a 16-deep stack (hidden from the programmer) on the Pentium 4 (and the Alpha AXP too I think): http://blogs.msdn.com/oldnewthing/archive/2004/12/ 16/317157.aspx

--
none

Re:Not a good idea by coats · 2006-08-13 00:58 · Score: 2, Interesting

Suppose you put *two* dozen of them on a chip, and suppose they are *four* times faster. You still have less than a quarter the performance of a Conroe or POWER5 (both of which are dual-core, with each core sustaining more than 200 instructions in flight at a time), and you still have to manage that parallelism "by hand". Actually, the "four times faster" won't work, either -- remember that memory is still 200 times slower than Conroe or POWER5; if memory were 800 times slower than your processor, you'd really lose your performance!

This has been discussed ad nauseam in the computer architecture community, and I repeat: it's not a good idea!

--
"My opinions are my own, and I've got *lots* of them!"

Slashdot Mirror

Next Generation Stack Computing

19 of 347 comments (clear)