Next-Gen Processor Unveiled
A bunch of readers sent us word on the prototype for a new general-purpose processor with the potential of reaching trillions of calculations per second. TRIPS (obligatory back-formation given in the article) was designed and built by a team at the University of Texas at Austin. The TRIPS chip is a demonstration of a new class of processing architectures called Explicit Data Graph Execution. Each TRIPS contains two processing cores, each of which can issue 16 operations per cycle with up to 1,024 instructions in flight simultaneously. The article claims that current high-performance processors typically are designed to sustain a maximum execution rate of four operations per cycle.
Come on now. It's a capitalist market. You can't just innovate your way to fame. Just like the list of 5 million other patents that never see the daylight.
It seems like for every "realist" claiming that Moore's law will soon hit a ceiling, I see another ZOMG Breakthrough! Lately, the question I've been asking myself is, "Will we ever surpass it?"
Isn't enough that I ruined a pony, making a gift for you?
Branches are no problem for TRIPS -- in the EDGE architecture, both execution paths resulting from a branch are computed, unlike in classic architectures where the processor blocks (8086), skips ahead a single instruction before blocking (MIPS), or chooses a path using a branch predictor and executing it, possibly only to discard all instructions issued since the branch, if the predictor turns out wrong. EDGE architectures still lag on cache misses (or any memory hit) -- but that's fundamentally a problem with memory, not with the processor. Don't read the article, read the UT pdf.
of course loop unwinding works fine... when you have a long loop. it does though have two problems. 1) it only works when you have very long loops where there are very little dependencies between the consecutive iterations of the loop 2) even when it does work, it causes the code footprint of the application to be much bigger which means you end up putting a lot more stress on your cache pipeline, requiring bigger caches and a wider fetch engine. And that all aside, what about the vast majority of code segments where massive parrallelizable loops are not being executed? Loop unwinding isn't going to help at all for those.
In a minute there is time For decisions and revisions which a minute will reverse. -T.S. Eliot
Oh, and before someone points this out for me, you have to imagine that the routing requirements are VASTLY improved. Imagine a grid of ALU's each connected by a single bus, (simple,) rather than 128 bypass busses all multiplexed in to each ALU. (chaos! don't forget the MUX logic!) You map one instruction to one (virtual) ALU, rather than one result to a (virtual) register. Then you pipeline/march each instruction with its partial data down the grid until all the inputs come in. Instructions continually cascade in the top of the grid, and commit out the bottom. But their results are available to feed other instructions as soon as they are computed! Never have to wait for a MUX or a bus or what-have-you. Plus, you can clock the whole thing EXTREMELY fast, because you don't have these wire-delays from difficult routing requirements.
Multiway branching is ancient, and it's not used much because it's very inefficient. At least half of the instruction stream after a branch will be canceled, two branches deep it is 75% and so on. No matter how much parallelism ou throw at this there are only marginal gains to made (exponential increase in number of execution units for a linear increase in depth). It still doesn't get around data dependencies which will be the major bottleneck if looking that far ahead in the instruction stream.
Having read the articles that were easy to get to, and the abstract of the PhD student: this is buzzword bollocks. There is no innovation in what they have done. As other people have pointed out this is a vector / datastream architecture. It's not a very good one at that. Although it has the "potential" to scale to terraflops, so does my toothbrush. On a 130 process they can fit 2 cores with 32-wide dispatch clocked at 500Mhz. My 7800 is fab'ed on a 130 process with 24*4*4 = 384 operation wide vector dispatch. This prototype would hit about 16 billion ops/sec, versus 180 Gflop on the 7800. This is a long way from terraflops, and doesn't convince me that it can scale.
As the 7800 is close to a systolic model there is a limited class of programs that can be executed; but those that are in that class exhibit (near)perfect parallelism and so have zero hit from memory access costs. Actually the internal bandwidth on the 7800 is a bottleneck for some computations but I'm just going for coarse detail here.
Edge appears mix and match ideas from several parallel designs; every one of which suffers from hard code generation problems. I suspect that the only sample applications that hit 32 ops / cycle are media apps (or dataflow problems as they used to be called) which normal architectures run at high speed anyway.
Interesting research, as it's always good to see people explore different designs, but it sounds overhyped and I believe that it has zero commercial appeal. Finally, as a sidenote, you are right about cache latencies being a memory defect rather than processor but there are ways around it. If you are willing to limit yourself to a certain class of applications (roughly the same one that executes well on most parallel architectures such as this, or GPUs) then you can completely avoid the latency. This provides a much bigger performance hike than any other technique as memory latency is a dominating factor in most runtimes. The only snag is that it is very hard to do, requires different fabrication technology (largely solved now), and lots of compiler advances... If you're interested then google for intelligent ram. It's about a decade of research now...
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
No, it isn't. The TRIPS group has done some really interesting things with compilers, for example. They've managed to have the compiler break up code into packets and schedule them on the processor array so that dependencies flow nicely across the grid. That is not an easy problem to tackle. This is very good research.
That's not the point of research. The point of research is to explore problems no one has tackled before, of course always with an eye toward future technology trends.