Slashdot Mirror


Next-Gen Processor Unveiled

A bunch of readers sent us word on the prototype for a new general-purpose processor with the potential of reaching trillions of calculations per second. TRIPS (obligatory back-formation given in the article) was designed and built by a team at the University of Texas at Austin. The TRIPS chip is a demonstration of a new class of processing architectures called Explicit Data Graph Execution. Each TRIPS contains two processing cores, each of which can issue 16 operations per cycle with up to 1,024 instructions in flight simultaneously. The article claims that current high-performance processors typically are designed to sustain a maximum execution rate of four operations per cycle.

7 of 183 comments (clear)

  1. Marketting hype? by faragon · · Score: 5, Informative

    Each TRIPS chip contains two processing cores, each of which can issue 16 operations per cycle with up to 1,024 instructions in flight simultaneously. Current high-performance processors are typically designed to sustain a maximum execution rate of four operations per cycle.

    It's me or are they trying to reparaphrasing, euphemistically, the Out-of-order execution?

    1. Re:Marketting hype? by smallfries · · Score: 5, Interesting

      Multiway branching is ancient, and it's not used much because it's very inefficient. At least half of the instruction stream after a branch will be canceled, two branches deep it is 75% and so on. No matter how much parallelism ou throw at this there are only marginal gains to made (exponential increase in number of execution units for a linear increase in depth). It still doesn't get around data dependencies which will be the major bottleneck if looking that far ahead in the instruction stream.

      Having read the articles that were easy to get to, and the abstract of the PhD student: this is buzzword bollocks. There is no innovation in what they have done. As other people have pointed out this is a vector / datastream architecture. It's not a very good one at that. Although it has the "potential" to scale to terraflops, so does my toothbrush. On a 130 process they can fit 2 cores with 32-wide dispatch clocked at 500Mhz. My 7800 is fab'ed on a 130 process with 24*4*4 = 384 operation wide vector dispatch. This prototype would hit about 16 billion ops/sec, versus 180 Gflop on the 7800. This is a long way from terraflops, and doesn't convince me that it can scale.

      As the 7800 is close to a systolic model there is a limited class of programs that can be executed; but those that are in that class exhibit (near)perfect parallelism and so have zero hit from memory access costs. Actually the internal bandwidth on the 7800 is a bottleneck for some computations but I'm just going for coarse detail here.

      Edge appears mix and match ideas from several parallel designs; every one of which suffers from hard code generation problems. I suspect that the only sample applications that hit 32 ops / cycle are media apps (or dataflow problems as they used to be called) which normal architectures run at high speed anyway.

      Interesting research, as it's always good to see people explore different designs, but it sounds overhyped and I believe that it has zero commercial appeal. Finally, as a sidenote, you are right about cache latencies being a memory defect rather than processor but there are ways around it. If you are willing to limit yourself to a certain class of applications (roughly the same one that executes well on most parallel architectures such as this, or GPUs) then you can completely avoid the latency. This provides a much bigger performance hike than any other technique as memory latency is a dominating factor in most runtimes. The only snag is that it is very hard to do, requires different fabrication technology (largely solved now), and lots of compiler advances... If you're interested then google for intelligent ram. It's about a decade of research now...

      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
  2. Re:Hm... by volsung · · Score: 5, Informative

    The vector processors never went away. They just became your graphics card: 128 floating point units at your command

    BTW, here is a real article on TRIPS.

  3. Gets rid of the register-file by DrDitto · · Score: 5, Insightful

    The EDGE architecture gets rid of relying on a single register file to communicate results between instructions. Instead, a producer-consumer ISA directly sends results to one of 128 instructions in a superblock (sort of like a basic block, but larger). In this way, hopefully more instruction-level parallelism can be extracted because superscalars can't really go beyond 4-wide (8-wide is a stretch...DEC was attempting this before Alpha was killed). Nice concept, but it doesn't solve many pressing problems in computer architecture, namely the memory wall and parallel programmability.

  4. Re:I want one by ackthpt · · Score: 5, Funny

    But when are they likely to be ready?



    • You know they'll be ready when Intel places large orders for aluminium for heatsinks.
    • You know they'll be ready when there's a sudden drop in prices of the current Hot CPUs, which are all proven but suddenly look like last month's pizza from under the couch.
    • You know they'll be ready when AMD hasn't said anything and they are suddenly shipping them, while Intel tells you in 9 mos. then suddenly says 3 mos. (and you can hear the whips cracking through the walls.)
    • You know they'll be ready when Microsoft doesn't have an operating system ready, but there are a dozen Linux distros good to go.
    --

    A feeling of having made the same mistake before: Deja Foobar
  5. Don't dismiss it by er824 · · Score: 5, Informative

    I apologize if I butcher some of the details, but I highly recommend that anyone interested peruse the TRIPS website.

    http://www.cs.utexas.edu/~trips/

    They have several papers available that motivate the rationale for a architecture.

    The designers of this architecture believed that conventional architectures were going to run into some physical limitations that were going to prevent them from scaling further. One of the issues they foresaw was that as feature size continued to shrink and die size continued to increase chips would become susceptible to, and ultimately constrained by wire delay. Meaning the amount of time it took to send a signal from one part of a chip to another would constrain the ultimate performance. To some extent the shift in focus to multi-core CPUS validates some of their beliefs.

    To address the wire delay problem the architecture attempts to limit the length of signal paths through the CPU by having instructions send their results directly to their dependent instructions instead of using intermediate architectural registers. TRIPS is similar to VLIW in that many small instructions are grouped into larger instructions (Blocks) by the compiler. However it differs in how the operations within the block are scheduled.

    TRIPS does not depend on the compiler to schedule the operations making up a block like a VLIW architecture does. Instead the TRIPS compiler maps the individual operations making up a large TRIPS instruction block to a grid of execution units. Each execution unit in the grid has several reservation stations, effectively forming a 3 dimensional execution substrate.

    By having the compiler assign data dependent instructions to execution units that are physically close to one another the communication overhead on the chip can be reduced. The individual operations wait for the operands to arrive at their assigned execution unit, once all of operations dependencies are available then the operation fires and its result is forwarded to any waiting instruction. In this way the operations making up the TRIPS are dynamically scheduled according to the data flow of the block and the amount of communications that have to occur across large distances are limited. Once an entire block is executed its can be retired and its results can be written to a register or memory.

    At the block level a TRIPS processor can still function much like a conventional processor. Blocks can be executed out of order, speculatively, or in parallel. They have also defined TRIPS as a polymorphous architecture meaning the configuration and execution dynamics can be changed to best leverage the available parallelism. If code is highly parallelizable it might make sense to allow bigger blocks mapped. However, by performing these type of operations at the level of a block instead of for each individual instruction the overhead is theoretically drastically reduced.

    There is some flexibility in how the hardware can be utilized. For some types of software with a high degree of parallelism you may want very large blocks, when there is less data level parallelism available it may be better to schedule multiple blocks onto the substrate simultaneously. I'm not sure how the prototype is implemented but the designers have several papers available where they discuss how a TRIPS style architecture can be adapted to perform well on a wide gamut of software.

  6. Re:Hm... by frank_adrian314159 · · Score: 5, Informative

    No, here are the real articles on TRIPS. These and many others can be found here.

    --
    That is all.