Next-Gen Processor Unveiled
A bunch of readers sent us word on the prototype for a new general-purpose processor with the potential of reaching trillions of calculations per second. TRIPS (obligatory back-formation given in the article) was designed and built by a team at the University of Texas at Austin. The TRIPS chip is a demonstration of a new class of processing architectures called Explicit Data Graph Execution. Each TRIPS contains two processing cores, each of which can issue 16 operations per cycle with up to 1,024 instructions in flight simultaneously. The article claims that current high-performance processors typically are designed to sustain a maximum execution rate of four operations per cycle.
But when are they likely to be ready?
A unique way to learn a language: http://languageloom.com
The article contains little more information than the blurb.
But it seems to me that we called this great new invention "vector processors" 15 years ago, and there is a reason they arent around anymore.
"Many instructions in flight"=="huge pipeline flushes on context switches"+"huge branching penalities" anybody?
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
1. Copy some university press release to your blog
2. Make sure google ads show up at the top of the page
3. Submit blog to slashdot
4. Profit
Each TRIPS chip contains two processing cores, each of which can issue 16 operations per cycle with up to 1,024 instructions in flight simultaneously. Current high-performance processors are typically designed to sustain a maximum execution rate of four operations per cycle.
It's me or are they trying to reparaphrasing, euphemistically, the Out-of-order execution?
Key Innovations: Explicit Data Graph Execution (EDGE) instruction set architecture
Scalable and distributed processor core composed of replicated heterogeneous tiles
Non-uniform cache architecture and implementation
On-chip networks for operands and data traffic
Configurable on-chip memory system with capability to shift storage between cache and physical memory
Composable processors constructed by aggregating homogeneous processor tiles
Compiler algorithms and an implementation that create atomically executable blocks of code
Spatial instruction scheduling algorithms and implementation
TRIPS Hardware and Software
The EDGE architecture gets rid of relying on a single register file to communicate results between instructions. Instead, a producer-consumer ISA directly sends results to one of 128 instructions in a superblock (sort of like a basic block, but larger). In this way, hopefully more instruction-level parallelism can be extracted because superscalars can't really go beyond 4-wide (8-wide is a stretch...DEC was attempting this before Alpha was killed). Nice concept, but it doesn't solve many pressing problems in computer architecture, namely the memory wall and parallel programmability.
The link has NO information.
The PDF here: has more information about EDGE.
The basic idea is that CISC/RISC architectures rely on storing intermediate data in registers (or in main memory on old skool CISC). EDGE bypasses registers: the output of one instruction is fed directly to the input of the next. No need to do register allocation while compiling. I'm still reading the PDF, this sounds like a really neat idea, though.
The only question is, will this be so much better than existing ISA's to eventually replace them? -- even if only for specific applications like high-performance computing.
It seems like for every "realist" claiming that Moore's law will soon hit a ceiling, I see another ZOMG Breakthrough! Lately, the question I've been asking myself is, "Will we ever surpass it?"
Isn't enough that I ruined a pony, making a gift for you?
The motivations for this technology provided in the article ignore some rather basic facts.
They point out that current multi-core architectures put a huge burden on the software developer. This is true, but their claim that this technology will relieve that burden is dubious. They mention, for example, that current processing cores can typically only perform 4 simultaneous operations per-core, and imply that this is some kind of weakness. They completely fail to mention that the vast majority of applications running on those processors don't even use the 4 available scheduling resources in each core. In other words, the number of applications that would benefit from being able to execute more than 4 simultaneous instructions in the same core is vanishingly small. This is why most current processors have stopped at 3 or 4. Not because they haven't thought of pushing it beyond that, but because it is expensive, and because it yields very little return on the investment. Very few real-world users would see any performance benefit if the current cores on the market were any wider than 3 or 4. Most of those users aren't even using the 4 that are currently available.
Certainly the ability to do 1024 operations simulatenously in a single core is impressive. But it is not an ability that magically solves any of the current bottlenecks in multi-threaded software design. Most software application developers have difficulty figuring out what to do with multiple-cores. Those same developers would have just as much (if not more) difficult figuring out what to do with a the extra resources in a core that can execute 1024 simultaneous operations.
In a minute there is time For decisions and revisions which a minute will reverse. -T.S. Eliot
I think it's mostly because the backronym is contrived and silly.
sic transit gloria mundi
you are absolutely right. no one should ever do any research into
something which doesn't ultimately look like an x86.
Right, let me begin by saying that after reading ftp://ftp.cs.utexas.edu/pub/dburger/papers/IEEECOM PUTER04_trips.pdf it actually became a bit more clear about what they were talking about.
8 8&lastnode_id=0 to see what transport-triggered architectures are about. They are more power efficient, etc etc.
.9 efficient) of optimality. I don't see the gain there in efficiency.
h ofstee_v1.0_18july2003_eindverslag.pdf here for my thesis.
It might sound very novel if you are only accustomed to normal processors. Look at MOVE http://www.everything2.com/index.pl?node_id=10322
Secondly, they talk about how execution graphs are mapped onto their processing grid. I don't think any scheduler has a problem with scheduling an execution graph (or whatever name you give it) to an architecture. Generally, it can be scheduled in-time (there is a critical path somewhere) or it is scheduled with a certain degree (generally >
Now here comes the shameless self-plug. If you want to gain efficiency in scheduling a node of an execution graph you have to know which node is more critical than the other. The critical nodes (the ones on the critical path) need to be scheduled to the fast/optimized processing units and the others can be scheduled to slow/efficient processing units (and they can get some communication delays without penalty). Look http://ce.et.tudelft.nl/publicationfiles/786_11_d
nosig today
EPIC (i.e. Itanium) is still based on centralized structures like register files. To create a 16-issue EPIC processor, you'd need a ~32R/16W port register file which would be virtually impossible to build because it would be so huge and power-hungry. Also, EPIC needs heroic compiler optimizations to overcome its in-order execution, while EDGE is naturally out-of-order.
Here is the slashdot article from 2003 about this processor: link
The specs have been updated to 1024 from 512, but that's about it.
Another 3-5 years out?
Don't steal. The government hates competition.
I apologize if I butcher some of the details, but I highly recommend that anyone interested peruse the TRIPS website.
http://www.cs.utexas.edu/~trips/
They have several papers available that motivate the rationale for a architecture.
The designers of this architecture believed that conventional architectures were going to run into some physical limitations that were going to prevent them from scaling further. One of the issues they foresaw was that as feature size continued to shrink and die size continued to increase chips would become susceptible to, and ultimately constrained by wire delay. Meaning the amount of time it took to send a signal from one part of a chip to another would constrain the ultimate performance. To some extent the shift in focus to multi-core CPUS validates some of their beliefs.
To address the wire delay problem the architecture attempts to limit the length of signal paths through the CPU by having instructions send their results directly to their dependent instructions instead of using intermediate architectural registers. TRIPS is similar to VLIW in that many small instructions are grouped into larger instructions (Blocks) by the compiler. However it differs in how the operations within the block are scheduled.
TRIPS does not depend on the compiler to schedule the operations making up a block like a VLIW architecture does. Instead the TRIPS compiler maps the individual operations making up a large TRIPS instruction block to a grid of execution units. Each execution unit in the grid has several reservation stations, effectively forming a 3 dimensional execution substrate.
By having the compiler assign data dependent instructions to execution units that are physically close to one another the communication overhead on the chip can be reduced. The individual operations wait for the operands to arrive at their assigned execution unit, once all of operations dependencies are available then the operation fires and its result is forwarded to any waiting instruction. In this way the operations making up the TRIPS are dynamically scheduled according to the data flow of the block and the amount of communications that have to occur across large distances are limited. Once an entire block is executed its can be retired and its results can be written to a register or memory.
At the block level a TRIPS processor can still function much like a conventional processor. Blocks can be executed out of order, speculatively, or in parallel. They have also defined TRIPS as a polymorphous architecture meaning the configuration and execution dynamics can be changed to best leverage the available parallelism. If code is highly parallelizable it might make sense to allow bigger blocks mapped. However, by performing these type of operations at the level of a block instead of for each individual instruction the overhead is theoretically drastically reduced.
There is some flexibility in how the hardware can be utilized. For some types of software with a high degree of parallelism you may want very large blocks, when there is less data level parallelism available it may be better to schedule multiple blocks onto the substrate simultaneously. I'm not sure how the prototype is implemented but the designers have several papers available where they discuss how a TRIPS style architecture can be adapted to perform well on a wide gamut of software.
Oh, and before someone points this out for me, you have to imagine that the routing requirements are VASTLY improved. Imagine a grid of ALU's each connected by a single bus, (simple,) rather than 128 bypass busses all multiplexed in to each ALU. (chaos! don't forget the MUX logic!) You map one instruction to one (virtual) ALU, rather than one result to a (virtual) register. Then you pipeline/march each instruction with its partial data down the grid until all the inputs come in. Instructions continually cascade in the top of the grid, and commit out the bottom. But their results are available to feed other instructions as soon as they are computed! Never have to wait for a MUX or a bus or what-have-you. Plus, you can clock the whole thing EXTREMELY fast, because you don't have these wire-delays from difficult routing requirements.