Actually, there is much more parallelism (more than 4 ops/cycle) available in many of these applications,
but you correctly observe that many of these ancillary features (branch mispredictions, cache misses, etc.)
chip away at the achieved parallelism. The TRIPS ISA and microarchitecture (which is, as you correctly
point out, a variant of an OOO "superscalar" processor) has numerous features to try to mitigate many
of these features... up to 64 outstanding cache misses from the 1,024-entry window, aggressive predication
to eliminate many branches, a memory dependence predictor, and direct ALU-ALU communication for
making data dependences more efficient.
The most important difference is in the ISA, which allows the compiler to express dataflow graphs to
directly to the hardware, which will work best (compared to convention) in ultra-small technologies where
the wires are quite slow. To get a similar dependence graph in a RISC or CISC ISA, a superscalar processor
must reconstruct it on the fly, instruction by instruction, using register renaming and issue window tag
broadcasting.
Thanks for reading.
The big difference in TRIPS is that stuff flying around out in memory can be squashed easily.
The machine has aggressive branch prediction, efficient predication support in the ISA,
and data dependence prediction. So, the 1024 instructions don't need to be long vectors
streaming from memory. Squashing a mispredicted branch and restarting down the right path
takes on the order of 10-20 machine cycles.
Thanks for your comments and interest.
-DB
Just to be clear, what we are proposing is a faster scalar processor that happens to have lots of arithmetic units, which is optimized (through lots of speculation) to run single threads quickly. We do break binary compatibility, but are working on a static translation tool to convert dusty deck binaries into TRIPS binaries with no programmer (obviously) intervention.
That being said, we are incorporating modes where graphics and DSP-type workloads can take advantage of all those arithmetic units. We'd really like to merge the general processing and DSP/graphics markets, and our dataflow-like ISA allows different application types to be automatically (not by the programmer) mapped to high-frequency, highly-concurrent substrates but still be treated as a single thread.
Actually, the goal for this processor is to keep increasing single-thread performance, and not just by pumping up the clock speed, but by doing more ops per cycle. So, we want to increase performance *without* putting programmers through all the pain of parallelizing their code into fine-grained chunks.
Do you mean processor emulation, where you basically have a bit switch statement based on which instruction you are emulating, jumping through an indirect branch to the execution semantics for each instruction? If so, that's an interesting problem. I think there's parallelism there, but I'll wait until I'm sure that's what were discussing before spending the cycles thinking!
The resource utilization (on a 16-wide ALU core) is low in single-threaded mode... ranging from about 1 to 10 ops per cycle (averaging around 4). That's the same average as the Alpha 21264, which has 4 units and averages about 1 on the same benchmarks.
When we run data-parallel threads (FFTs, FITs, DCTs, etc.) we show higher utilization, of course.
My hope is to raise the utilization with more tuning for the low-ILP benchmarks, but in the end, I care about performance since the transistors are cheap, so long as those transistors can also be used to exploit parallelism in other workloads when it exists.
Thanks for an excellent question.
Two clarifications... one of the chip's modes IS designed for fast single-threaded, general purpose computing. Second, I'm biased too, but I think the ISA and the execution model are pretty revolutionary.:-)
What we're really after is building large single cores (16 ops in parallel, eventually 64 or 128). The compiler doesn't have to find them, just schedule the code and let the out-of-order window take care of it.
You can use the multiple processors if you have an application that parallelizes nicely. I'd rather have 10 100-op cores than 1000 1-op cores!
Actually, we are working a fair amount on latency of single threads. We deal with branches by predicating (removing) large numbers of them, so that ideally there are a hundred instructions or so between each branch (since when you predicate, both paths are fetched, you improve branch prediction at the expense of fetching more useless instructions). Our results show that this is a *big* win.
And yes, I'd love a 3ps memory too!
Your point about yields was right on... we have been discussing implementing redundant rows of arithmetic units, similar to how they do it in DRAM, so that the large area doesn't whack you in future technologies. The prototype won't have anything like that, of course...
We have three approaches to memory in TRIPS. One is by using a large (>1000 instruction) issue window, we tolerate lots of memory latency behind other work. Second, we have a low-latency cache design called NUCA (non-uniform cache array). Third, we are incorporating efficient recovery from data value speculation (take a cache miss? Guess the value!)
Thanks for pointing out this important issue!
Actually, there is much more parallelism (more than 4 ops/cycle) available in many of these applications, but you correctly observe that many of these ancillary features (branch mispredictions, cache misses, etc.) chip away at the achieved parallelism. The TRIPS ISA and microarchitecture (which is, as you correctly point out, a variant of an OOO "superscalar" processor) has numerous features to try to mitigate many of these features ... up to 64 outstanding cache misses from the 1,024-entry window, aggressive predication
to eliminate many branches, a memory dependence predictor, and direct ALU-ALU communication for
making data dependences more efficient.
The most important difference is in the ISA, which allows the compiler to express dataflow graphs to
directly to the hardware, which will work best (compared to convention) in ultra-small technologies where
the wires are quite slow. To get a similar dependence graph in a RISC or CISC ISA, a superscalar processor
must reconstruct it on the fly, instruction by instruction, using register renaming and issue window tag
broadcasting.
Thanks for reading.
The big difference in TRIPS is that stuff flying around out in memory can be squashed easily. The machine has aggressive branch prediction, efficient predication support in the ISA, and data dependence prediction. So, the 1024 instructions don't need to be long vectors streaming from memory. Squashing a mispredicted branch and restarting down the right path takes on the order of 10-20 machine cycles. Thanks for your comments and interest. -DB
Just to be clear, what we are proposing is a faster scalar processor that happens to have lots of arithmetic units, which is optimized (through lots of speculation) to run single threads quickly. We do break binary compatibility, but are working on a static translation tool to convert dusty deck binaries into TRIPS binaries with no programmer (obviously) intervention. That being said, we are incorporating modes where graphics and DSP-type workloads can take advantage of all those arithmetic units. We'd really like to merge the general processing and DSP/graphics markets, and our dataflow-like ISA allows different application types to be automatically (not by the programmer) mapped to high-frequency, highly-concurrent substrates but still be treated as a single thread.
Actually, the goal for this processor is to keep increasing single-thread performance, and not just by pumping up the clock speed, but by doing more ops per cycle. So, we want to increase performance *without* putting programmers through all the pain of parallelizing their code into fine-grained chunks.
Do you mean processor emulation, where you basically have a bit switch statement based on which instruction you are emulating, jumping through an indirect branch to the execution semantics for each instruction? If so, that's an interesting problem. I think there's parallelism there, but I'll wait until I'm sure that's what were discussing before spending the cycles thinking!
The resource utilization (on a 16-wide ALU core) is low in single-threaded mode ... ranging from about 1 to 10 ops per cycle (averaging around 4). That's the same average as the Alpha 21264, which has 4 units and averages about 1 on the same benchmarks.
When we run data-parallel threads (FFTs, FITs, DCTs, etc.) we show higher utilization, of course.
My hope is to raise the utilization with more tuning for the low-ILP benchmarks, but in the end, I care about performance since the transistors are cheap, so long as those transistors can also be used to exploit parallelism in other workloads when it exists.
Thanks for an excellent question.
Thanks for the nice comment ... and no, I still don't have root access to my own machines!
Two clarifications ... one of the chip's modes IS designed for fast single-threaded, general purpose computing. Second, I'm biased too, but I think the ISA and the execution model are pretty revolutionary. :-)
What we're really after is building large single cores (16 ops in parallel, eventually 64 or 128). The compiler doesn't have to find them, just schedule the code and let the out-of-order window take care of it. You can use the multiple processors if you have an application that parallelizes nicely. I'd rather have 10 100-op cores than 1000 1-op cores!
Actually, we are working a fair amount on latency of single threads. We deal with branches by predicating (removing) large numbers of them, so that ideally there are a hundred instructions or so between each branch (since when you predicate, both paths are fetched, you improve branch prediction at the expense of fetching more useless instructions). Our results show that this is a *big* win. And yes, I'd love a 3ps memory too!
Your point about yields was right on ... we have been discussing implementing redundant rows of arithmetic units, similar to how they do it in DRAM, so that the large area doesn't whack you in future technologies. The prototype won't have anything like that, of course ...
We have three approaches to memory in TRIPS. One is by using a large (>1000 instruction) issue window, we tolerate lots of memory latency behind other work. Second, we have a low-latency cache design called NUCA (non-uniform cache array). Third, we are incorporating efficient recovery from data value speculation (take a cache miss? Guess the value!) Thanks for pointing out this important issue!