Next Generation Stack Computing

Size and functionality by Angst+Badger · 2006-08-10 07:25 · Score: 4, Insightful

He also claims that a kernel would only be a few kilobytes large!

I've seen sub-1k kernels for FORTH systems before. The question is, how much functionality do you want to wrap into that kernel? More capable kernels would, of course, be correspondingly larger.

That said, stack computing and languages like FORTH have long been underrated. Depending on the application, the combination of stack computers and postfix languages can be quite powerful.

--
Proud member of the Weirdo-American community.

Re:Size and functionality by merlin_jim · 2006-08-10 07:55 · Score: 4, Insightful

Couldn't you program a stack computer just as well with a prefix functional language like Scheme?

Sure you can - and it compiles to postfix notation anyways, rather ineffeiciently I might add (get it, add????)

let's say you wanted to write a function like:
function addsubandmultiply(b, c, d, e) {
a = (b + c) * (d - e);
return a;
}

and you've got assembly level instructions such as mov, add, sub, mult, push, and pop, as well as
the very stack-centric stor and lod, allowing you to move one or more stack variables to memory and
the reverse.

A typical register based computer might compile the above as:

pop b
pop c
pop d
pop e
mov b, ax
mov c, bx
add bx
mov ax, temp_memory
mov d, ax
mov e, bx
sub bx
mov temp_memory, bx
mult bx
push a

Whereas a stack-based computer might compile as:

add
stor temp_memory
sub
lod temp_memory
mult

In a stack based computer, operations are carried out directly on your stack... it's very convenient,
since most languages compile function calls to use the stack anyways, and as you can see not having
to deal with an accumulator register makes for much terser code. Between 20 - 40% of your compiled code is spent moving data in and out of the accumulator register, since most instructions depend on
specific data being in that register - to the point that they introduced zero-cycle add/mov functionality in the P4 line - basically, if your code performs an add and then movs ax immediately
out to memory (like the above code - and possibly the most common arithmetic operation in compiled code), if the pipeline and data caches are all available, the P4 will
execute both instructions with enough time to put something else in the instruction pipeline that
cycle. It's not really a zero-cycle function - you can do something like 2.5 (add,mov,add,mov,add) a cycle if you stack them back to back to back, for instance...

Yes, Intel released a benchmark for it. No, I can't imagine why you would want to keep adding and moving the results around memory - maybe some esoteric functions like a fibbanoci generator or even a DSP algorithm of some sort might need to do it, but I don't think it'll be all that often... or that any compiler would have an optimisation to specifically output that sequence if appropriate...

--
I am disrespectful to dirt! Can you see that I am serious?!
Re:Size and functionality by Anonymous Coward · 2006-08-10 08:27 · Score: 4, Insightful

Actually x86 is inbetween a stack machine and a register based machine.
What most register machines compile the following code:
function addsubandmultiply(b, c, d, e) {
a = (b + c) * (d - e);
return a;
}
Into something like (sorry for PPC asm):
add r3, r3, r4
sub r4, r5, r6
mulw r3, r3, r4
blr #(return)

Now tell me that is not just as simple (or even simplier) as the stack based one?

Awesome by LinuxFreakus · 2006-08-10 07:27 · Score: 4, Insightful

Does this mean my old HP48GX will be considered cutting edge? I should get ready to sell it on EBay when the craze hits! All my old classmates will be forced to allow me to have the last laugh after I was on the recieving end of much ridicule for using the HP when the TI was the only thing "officially" endorsed by all the calculus textbooks. I don't know if I could ever part with it though. I still use it almost daily, the thing continues to kick ass.

Re:Twelfth of Never by jellomizer · 2006-08-10 07:35 · Score: 2, Insightful

But in reality this would be a Major Redesign of the OS, and all Apps would need to be recompiled/emulated. Registers are a core part of assembably language. Having to remake Windows would be like making Windows for the Power PC, If not more difficult.

--
If something is so important that you feel the need to post it on the internet... It probably isn't that important.

There is one very widely used FORTH-type language by porkchop_d_clown · 2006-08-10 07:46 · Score: 2, Insightful

that almost every /. user encounters every day: Postscript and PDF.

--
Clear, Dark Skies

Appropriate instruction set by dpilot · 2006-08-10 07:46 · Score: 4, Insightful

Even in assembler, the mainstream hasn't been programming to the metal since Pentium I.

Beginning with Pentium II, and propagating to pretty much all of the other archictures in a short time, non of the mainstream CPUs have exposed their metal. We have an instruction set, but it's torn into primitives and scheduled for execution. We don't see the primitives, not even in assembler. AFAIK, there isn't even a way to use the true primitives, except perhaps on the Transmeta, where it was undocumented.

So in this light, since we're already fairly far from the true metal, it seems to me that it makes a lot of sense to re-evaluate the instruction set itself. Of course one could raise the Itanium argument, but I would also argue that politics were too big a part, there. Then again, one could also argue that x86 and amd64 are just so entrenched that it doesn't matter, and they do run well on today's hardware.

Then again I could cite my old favorite, the 6809. It started from the same origins and precepts as RISC, but a different attitude. RISC simply tried to optimize the most common operations, at the expense of less common ones. With the 6809, they tried to understand WHY certain things were happening, and how those things could be done better and faster. They ended up with a few more transistors, the same speed, and something approaching 3X the throughput, as compared to the 6800. More similar to the current topic, there was a paper on 'contour mapping', mapping blocks of cache into stacks and data structures. The 6809 was too old for a cache, but it seems to me that combining it's concepts with the contour mapping would be interesting indeed.

But like stack engines, it's not x86/amd64 compatible.

--
The living have better things to do than to continue hating the dead.

Re:Appropriate instruction set by stevesliva · 2006-08-10 08:47 · Score: 2, Insightful

He's saying that the latest Intel chips run micro-ops that do not have a 1-to-1 correspondence with the x86 ISA to which you refer. Git it?

--
Who do you get to be an expert to tell you something's not obvious? The least insightful person you can find? -J Roberts

Re:Twelfth of Never by real_b0fh · 2006-08-10 07:47 · Score: 3, Insightful

actually, if windows is 'done right' all it would take is a recompile. I don't think there is a lot of assembler code in the windows source that needs to be rewritten, most code will be in C or C++.

--
"Contrary to popular belief, UNIX is user friendly. It just happens to be selective on who it makes friendship with"

Why these downright stupid comments? by Jerk+City+Troll · 2006-08-10 07:47 · Score: 3, Insightful

You “wonder if Windows will run on a stack computer?” Where do you people come up with this nonsense? This is as irrelevant as saying: "someday, car tires will not be made of rubber. I wonder if Windows will support them?" Really, there is no need to try to come up with insightful remarks or questions to tack on the end of your story submissions. Just present the article and leave it at that. Let everyone else do the thinking.

--
Join Tor today!

Stack machines - again? by Animats · 2006-08-10 07:54 · Score: 3, Insightful

Who can forget the English Electric Leo-Marconi KDF9, the British stack machine from 1960. That, and the Burroughs 5000, were where it all began.

Stack machines are simple and straightforward to build, but are hard to accelerate or optimize. Classically, there's a bottleneck at the top of the stack; everything has to go through there. With register machines, low-level concurrency is easier. There's been very little work on superscalar stack machines. This student paper from Berkeley is one of the few efforts.

It's nice that you can build a Forth machine with about 4000 gates, but who cares today? It would have made more sense in the vacuum tube era.

Not a good idea by coats · 2006-08-10 07:55 · Score: 4, Insightful

The reason modern systems are so fast is that they hide a lot of fine grained parallelism behind the scenes. It is very hard to express this kind of parallelism in a way that it can be executed on a stack machine.

How important is this parallism? Consider that modern processors have 10-30 pipeline stages, 3-6 execution units that can have an instruction executing at each stage; moreover, most of them have out-of-order execution units that handle instructions more in the order that data is available for them rather than the order they are listed in the object file (and main memory is hundreds of times slower than the processors themselves, so this is important!). Typically, such processors can have more than 100 instructions in some stage of execution (more than 250 for IBM POWER5 :-)

Consider, also, that the only pieces of anything-like-current stack hardware are Intel x87-style floating point units, that Intel is throwing away -- for good reason! -- in favor of (SSE) vector style units. In the current Intel processors, the vector unit emulates an x87 if it needs to -- but giving only a quarter of the performance.

Someone made remarks about Java and .Net interpreters: in both cases, the interpreter is simulating a purely scalar machine with no fine grained parallelism; no wonder an extensible software-stack implementation is one of the simplest to implement. Stacks are not the way that true Java compilers like gjc generate code, though!

No, stack-based hardware is not a good idea. And haven't been since some time in the eighties, when processors started to be pipelined, and processor speed started outstripping memory speed.

--
"My opinions are my own, and I've got *lots* of them!"

NASA by HTH+NE1 · 2006-08-10 07:55 · Score: 4, Insightful

Apparently NASA uses stack computers in some of their probes.

Is that supposed to be a ringing endorsement? I thought NASA was using components the rest of the world treated as obsolete due their proven durability and reliability in the radiation of space.

--
Oh, say does that Star-Spangled Banner entwine / The myrtle of Venus with Bacchus's vine?

Stack computers are hardly new by Junks+Jerzey · 2006-08-10 08:09 · Score: 4, Insightful

Normally this kind of stuff doesn't bug me, but this is like an article in 2006 proclaiming the benefits of object-oriented programming. Doesn't anyone know their computing history?

There were stack computers in the 1960s and 1970s. There was a resurgence of interest in the 1980s--primarily because of Forth's popularity in embedded systems--resulting in a slew of stack-based "Forth engines." Forth creator Chuck Moore has been working on a series of custom Forth CPUs for 20+ years now. His latest version has 24 cores on one chip (and was entirely designed by one person and uses MILLIWATTS of power).

Stack processors and languages have one big advantage: they minimize the overall complexity of a system. The tradeoff is that they often push some of that complexity onto the programmer. That's why Forth tends to shine for small, controlled systems (like a fuel injector or telescope controller or fire alarm), but you don't see people writing 3D video games or web browsers in Forth.

Re:Stop Hurting My Eyes by HiThere · 2006-08-10 08:23 · Score: 2, Insightful

I believe that your criticisms apply to only specific stack based architectures. That they do apply to the commonly presumed architectures I accept, but this is far from asserting them as general truths.

Actually, even asserting that register based computers solve the problems that you are describing is not a general truth. You need to specify how many registers of what type can deal with how many out of order processes. And I suspect that a stack computer with 5 or 6 stacks could deal as flexibly with that problem as is commonly required...with a bit of extra leeway. It would need to be able to implement rapid task switching based on a "high priority task stack"...and maintaining that would be a bit of a nuisance...but that particular stack could have a very short limit, say 50 items. I'll agree that this is one function that it would be better to handle in scratchpad memory, but it would be eminently possible to do it purely from a stack based approach. (Still, there's a good reason that priority queues are queues rather than stacks...and I would argue that a dequeue would be an even better approach.

Well, I'm neither a hardware engineer nor a computer system designer, so I could be wrong. OTOH, you're anonymous, which means that your arguments only deserve the weight that their own internal logic provides.

--

I think we've pushed this "anyone can grow up to be president" thing too far.

Stack - bad for speed, good for low power by Theovon · 2006-08-10 08:59 · Score: 5, Insightful

I'm a chip designer, and I am working on my Ph.D. in CS. The idea of stack machines is something I have researched a bit, and I have drawn some of my own conclusions.

The main advantage of stack machines is that all or most parameters for each instruction are implicit. Aside from stack shuffle/rotate instructions, the operands are always the top few on the stack. This makes instructions very small. The logic is also exceedingly simple (for fixed-stack designs). If you want a simple, low-power CPU, a stack machine is what you want.

Where I explored this issue, however, is in the realm of high-performance computing. The key advantage of a stack architecture is that smaller instructions take less time to fetch from memory. If your RISC instructions are 32 bits, but your stack machine instructions are 8 bits, then your instruction caches are effectively 4x larger, and your over-all cache miss penalty is greatly reduced.

The problem with stack machines is that they're damn near impossible to add instruction-level parallelism to. With a RISC machine, near-by instructions that deal with different registers (i.e. no dependencies) can be executed in parallel (whether that's multi-issue or just pipelining). With a stack machine, everything wants to read/write the top of the stack.

I came up with two things to deal with this problem, that are very much like the CISC-to-RISC translation done by modern x86 processors, so it's more of a stack ISA on a RISC architecture. One is that the stack is virtual. When you want to pop from the stack, what's happening in the front-end of the CPU is that you're just popping register numbers corresponding to a flat register file. When you want to push, you're allocating an assigned register number from the flat register file. Now, if you can get two instructions going that read different parts of the stack and write (naturally) to different locations, you can parallelize them. The second part is a healthy set of register shuffling instructions. Since you're doing all of this allocation up front, shuffling registers is as simple as renumbering things in your virtual stack. So a swap operation swaps two register numbers (rather than their contents), and a rotate operation renumbers a bunch of them, but the pending instructions being executed still dump their results in the same physical registers.

This all sounds great, but there are some problems with this:

(1) The shuffling instructions are separate instructions. With a RISC processor, you have more information all in one unit. Although you could try to fetch and execute multiple stack instructions at once, it's much more complicated to execute four stack instructions in parallel than to execute a single RISC instruction, even though they require the same amount of memory.
(2) You need a lot of shuffling instructions. Say your stack contains values A, B, C, and D, and you want to sum them. Without shuffling, you'd add A and B, yielding E, then add E and C, yielding F, then add F and D. Three add instructions. If your adder(s) is/are pipelined, you'd like to add A+B and C+D in parallel or overlapping, THEN wait around for their results and do the third add. The problem is that to do that, you'd need to add A+B, then rotate C to the top then D to the top, then add, then add again. The first case was 3 instructions; the second case is 5 instructions. Depending on your architecture, the extra shuffle instructions may take so long to process that you might as well just have waited. No speed gain at all.
(3) The extra shuffing instructions take up space. Optimizers are hard to write. Although it's conceivable that one could optimize for this architecture so as to avoid as many shuffling instructions as possible, you still end up taking up quite a lot of space with them, potentially offsetting much of the space savings that you got from switching from RISC to stack.

So, there you have it. Somewhat OT, because surely NASA's primary goal has got to be low-power, but also somewhat on-topic because stack architectures aren't the holy grail. Just ideal for some limited applications.

Re:Oh? by Anonymous Coward · 2006-08-10 12:16 · Score: 1, Insightful

Bzzzzt! No actual machine can ever be Turing complete, because theoretical Turing machines are capable of calculations which require an unbounded amount of space.

Bzzzzt! No actual machine may be Turing complete because all actual machines are bounded (even if the difference rarely matters in practice), but a machine architecture can certainly be Turing complete, because they're also theoretical, abstract constructs. (To make a connection with your statements, a machine architecture is equivalent to a programming language, although that's a rather limited way of looking at Turing completeness as well.)

Proof: Simply specify any of the available explicit Turing machine descriptions as your machine architecture. You may only be able to build a reduced version of it, but your machine architecture is still Turing complete. And that's what we're talking about, machine architectures, not implementations.

Slashdot Mirror

Next Generation Stack Computing

17 of 347 comments (clear)