Next-Gen Processor Unveiled
A bunch of readers sent us word on the prototype for a new general-purpose processor with the potential of reaching trillions of calculations per second. TRIPS (obligatory back-formation given in the article) was designed and built by a team at the University of Texas at Austin. The TRIPS chip is a demonstration of a new class of processing architectures called Explicit Data Graph Execution. Each TRIPS contains two processing cores, each of which can issue 16 operations per cycle with up to 1,024 instructions in flight simultaneously. The article claims that current high-performance processors typically are designed to sustain a maximum execution rate of four operations per cycle.
But when are they likely to be ready?
A unique way to learn a language: http://languageloom.com
The article contains little more information than the blurb.
But it seems to me that we called this great new invention "vector processors" 15 years ago, and there is a reason they arent around anymore.
"Many instructions in flight"=="huge pipeline flushes on context switches"+"huge branching penalities" anybody?
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
1. Copy some university press release to your blog
2. Make sure google ads show up at the top of the page
3. Submit blog to slashdot
4. Profit
Vista?
Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
What truth?
There is no dupe
Each TRIPS chip contains two processing cores, each of which can issue 16 operations per cycle with up to 1,024 instructions in flight simultaneously. Current high-performance processors are typically designed to sustain a maximum execution rate of four operations per cycle.
It's me or are they trying to reparaphrasing, euphemistically, the Out-of-order execution?
Oh, say does that Star-Spangled Banner entwine / The myrtle of Venus with Bacchus's vine?
Key Innovations: Explicit Data Graph Execution (EDGE) instruction set architecture
Scalable and distributed processor core composed of replicated heterogeneous tiles
Non-uniform cache architecture and implementation
On-chip networks for operands and data traffic
Configurable on-chip memory system with capability to shift storage between cache and physical memory
Composable processors constructed by aggregating homogeneous processor tiles
Compiler algorithms and an implementation that create atomically executable blocks of code
Spatial instruction scheduling algorithms and implementation
TRIPS Hardware and Software
Imagine a beowulf cluster of these!
Here From the horses mouth. Plus we don't have to keep that damn digg thing. Come on, guys. A little less fluff please.
What?
TRIPpy, dude....
LedgerSMB: Open source Accounting/ERP
So i assume its software compatible with 90% of the code that the 'general public' uses?
It did say 'general purpose' and if you try to create something beter but different, you get slapped down eventually ( like PowerPC Apples. )
---- Booth was a patriot ----
How about the TRIP UP processor, that needs a new set of feet. Or, how about the DRIPS processor that needs a paper towel to blot it dry. Or, how about, the NIPS processor, that needs nip/tuck. Or, how about the SHITS processor, that needs toilet paper. Little on the details, much on the promises is usually a bad sign.
The EDGE architecture gets rid of relying on a single register file to communicate results between instructions. Instead, a producer-consumer ISA directly sends results to one of 128 instructions in a superblock (sort of like a basic block, but larger). In this way, hopefully more instruction-level parallelism can be extracted because superscalars can't really go beyond 4-wide (8-wide is a stretch...DEC was attempting this before Alpha was killed). Nice concept, but it doesn't solve many pressing problems in computer architecture, namely the memory wall and parallel programmability.
But when are they likely to be ready?
The link has NO information.
The PDF here: has more information about EDGE.
The basic idea is that CISC/RISC architectures rely on storing intermediate data in registers (or in main memory on old skool CISC). EDGE bypasses registers: the output of one instruction is fed directly to the input of the next. No need to do register allocation while compiling. I'm still reading the PDF, this sounds like a really neat idea, though.
The only question is, will this be so much better than existing ISA's to eventually replace them? -- even if only for specific applications like high-performance computing.
It seems like for every "realist" claiming that Moore's law will soon hit a ceiling, I see another ZOMG Breakthrough! Lately, the question I've been asking myself is, "Will we ever surpass it?"
Isn't enough that I ruined a pony, making a gift for you?
Here is the web page for TRIPS, straight from UT austin:
http://www.cs.utexas.edu/~trips/
Enjoy.
The motivations for this technology provided in the article ignore some rather basic facts.
They point out that current multi-core architectures put a huge burden on the software developer. This is true, but their claim that this technology will relieve that burden is dubious. They mention, for example, that current processing cores can typically only perform 4 simultaneous operations per-core, and imply that this is some kind of weakness. They completely fail to mention that the vast majority of applications running on those processors don't even use the 4 available scheduling resources in each core. In other words, the number of applications that would benefit from being able to execute more than 4 simultaneous instructions in the same core is vanishingly small. This is why most current processors have stopped at 3 or 4. Not because they haven't thought of pushing it beyond that, but because it is expensive, and because it yields very little return on the investment. Very few real-world users would see any performance benefit if the current cores on the market were any wider than 3 or 4. Most of those users aren't even using the 4 that are currently available.
Certainly the ability to do 1024 operations simulatenously in a single core is impressive. But it is not an ability that magically solves any of the current bottlenecks in multi-threaded software design. Most software application developers have difficulty figuring out what to do with multiple-cores. Those same developers would have just as much (if not more) difficult figuring out what to do with a the extra resources in a core that can execute 1024 simultaneous operations.
In a minute there is time For decisions and revisions which a minute will reverse. -T.S. Eliot
with individual instructions no longer spitting out to registers, and , to quote you Compiler algorithms and an implementation that create atomically executable blocks of code , does this not mean they can finally hide keys from us in the die of a general purpose processor?
VLC FOR MAC IS DYING! IF YOU DEVELOP, PLEASE SAVE IT!!
Right, let me begin by saying that after reading ftp://ftp.cs.utexas.edu/pub/dburger/papers/IEEECOM PUTER04_trips.pdf it actually became a bit more clear about what they were talking about.
8 8&lastnode_id=0 to see what transport-triggered architectures are about. They are more power efficient, etc etc.
.9 efficient) of optimality. I don't see the gain there in efficiency.
h ofstee_v1.0_18july2003_eindverslag.pdf here for my thesis.
It might sound very novel if you are only accustomed to normal processors. Look at MOVE http://www.everything2.com/index.pl?node_id=10322
Secondly, they talk about how execution graphs are mapped onto their processing grid. I don't think any scheduler has a problem with scheduling an execution graph (or whatever name you give it) to an architecture. Generally, it can be scheduled in-time (there is a critical path somewhere) or it is scheduled with a certain degree (generally >
Now here comes the shameless self-plug. If you want to gain efficiency in scheduling a node of an execution graph you have to know which node is more critical than the other. The critical nodes (the ones on the critical path) need to be scheduled to the fast/optimized processing units and the others can be scheduled to slow/efficient processing units (and they can get some communication delays without penalty). Look http://ce.et.tudelft.nl/publicationfiles/786_11_d
nosig today
Can it run Linux?
-----BEGIN PGP SIGNATURE-----
12345
-----END PGP SIGNATURE-----
A lot of this is due to the fact that most popular languages right now do not support concurrency very well. Most common languages are stateful, and state and concurrency are rather antithetical to one another. The solution is to gradually evolve toward languages that solve this either by forsaking state (Haskell, Erlang) or by using something like transaction memory for encapsulating state in a way that is easy to deal with (Haskell's STM, Fortress (I think), maybe some others).
Concurrency is not that hard to do well in the right setting.
And before anyone claims that Haskell and Erlang are impractical, there are many examples of "real world" programs written in them.
A few nice, and very useful ones are Yaws (for erlang) and Darcs (for Haskell). There are many others (even quake clones), which I won't bother listing, but people can find them easily if they look.
Regarding concurrency, and its ease of use in these languages, I'm taking a machine learning class at the moment where most of the problems are computationally intensive, and could stand for improvement by making use of multiple cores. I do all of my assignments in Haskell, and not only are my solutions often shorter than those of my classmates (and they often work fine the first time they compile), but it's usually trivial to allow my application to scale nicely to as many cores as I can throw at it. It's worth mentioning, by the way, that most algorithms given in these classes are given under the assumption that people are using imperative languages, and even then, it's still easy. It takes a while to learn how to approach problems differently without mutable state, yes, but it's not as hard as some people make it out to be. I think it has more to do with the fact that people just don't like to learn anything new unless they absolutely are forced to do so, which is a pity.
By the way, there is a nice presentation from Tim Sweeney on what he would like future programming languages to look like, and there's a lot in there about functional programming, concurrency, and expressive (re: dependent) types.
data graph sounds suspiciously like some kind of branching transactional execution system.
...and the first round of lab testing for EPIC. If they keep this up, eventually they can independently invent Itanium1 (yuk).
I love how they skipped EPIC in their comparison section in the pdf.
Is it fast enough to run Vista?
I've worked in detail with a VLIW (Very Long Instruction Word) architecture, the TI 'C6x DSP. It has eight execution units (not all of which can perform the same operations, though there is a little overlap) which can all be active in a single cycle. However, the key is keeping all of the units busy.
While the C compiler for this architecture is incredibly good, there are situations where using raw assembly (quite hard because of pipelining issues) or "compiled assembly" (easier, since you write in the order you wish operations to occur, and the compiler schedules the pipeline for you) gives better performance.
In short, no matter how much hardware folks can throw at a computing problem, the issue is adapting lots of different kinds of software to the architecture. Sounds like the compiler is going to have to be very good, or else there will have to be some runtime mojo to keep all of the chip doing something useful.
Does it run Linux?
That is food for thought.... thanks! ;-)
In a minute there is time For decisions and revisions which a minute will reverse. -T.S. Eliot
But when are they likely to be ready?
You can read more about it here...
Actually, from what I can tell it's more like a VLIW with it's program chopped up into horizontal and vertical microcode "chunks" for more efficient register forwarding, than a vector processor...
I figure that it chops up the code into 128-instruction chunks (or smaller if there are branch dependancies that can't be done with predicates) and schedules it horizontally (the classic wide VLIW microcode which feeds independent instruction pipelines), and vertically (the sequence that can distribute over time and use register forwarding paths). The pipelines seem to be loosely coupled through reservation stations and the forwarding done with low bandwidth wormhole routes so it isn't a rigid as a classic VLIW machine.
I doubt it does that much better with normal scalar code (which has lots of branches), but it probably is much better than a vector processor would be with irregular code.
will the new name be captain trips?
I prefer the "u" in honour as it seems to be missing these days.
...nobody is going to use it if it's not x86 compatible.
Here is the slashdot article from 2003 about this processor: link
The specs have been updated to 1024 from 512, but that's about it.
Another 3-5 years out?
Don't steal. The government hates competition.
The big thing that all the commenters have missed that I've read so far is the fact that OOO execution is difficult not because it's hard to make many ALU's on a chip (vector design, anyone?) but because in a general-purpose processor the register file and routing complexity grows as N^2 in the number of units. That's bad. Every unit has to communicate with every other unit (via the register file or, more commonly, via bypasses to an OOO buffer for every stage prior to writeback). The issue being addressed here is wiring complexity which, as modern designers would tell you, is a much harder problem than designing fast logic. Routing is hard. Plunking down more ALU's is easy. If you eliminate the register file, and design your processor and ISA to feed instructions in a data-flow manner to thousands of ALU's then you may be able to vastly simplify routing requirements, thereby decreasing the length of your critical path electrical circuits, thereby allowing the processor to clock faster. (Data-flow execution is executing instructions when their data inputs are ready, rather than tracking the compiler-optimized order, which does not have the run-time information that the hardware has.) If you are clever about your compiler, and make your hardware wide enough, you can for example speculatively execute both sides of a branch until it is resolved, thus eliminating a certain percentage of pipeline stalls for branch mispredicts. Similarly, with data-prediction you can speculate during cache misses. The list goes on. This is a very new and different paradigm (ugly word) for CPUs which may lead to higher IPC. This isn't the single golden goose, but it's a very different way of looking at the problem of pushing more instructions through a processor at higher speeds.
I apologize if I butcher some of the details, but I highly recommend that anyone interested peruse the TRIPS website.
http://www.cs.utexas.edu/~trips/
They have several papers available that motivate the rationale for a architecture.
The designers of this architecture believed that conventional architectures were going to run into some physical limitations that were going to prevent them from scaling further. One of the issues they foresaw was that as feature size continued to shrink and die size continued to increase chips would become susceptible to, and ultimately constrained by wire delay. Meaning the amount of time it took to send a signal from one part of a chip to another would constrain the ultimate performance. To some extent the shift in focus to multi-core CPUS validates some of their beliefs.
To address the wire delay problem the architecture attempts to limit the length of signal paths through the CPU by having instructions send their results directly to their dependent instructions instead of using intermediate architectural registers. TRIPS is similar to VLIW in that many small instructions are grouped into larger instructions (Blocks) by the compiler. However it differs in how the operations within the block are scheduled.
TRIPS does not depend on the compiler to schedule the operations making up a block like a VLIW architecture does. Instead the TRIPS compiler maps the individual operations making up a large TRIPS instruction block to a grid of execution units. Each execution unit in the grid has several reservation stations, effectively forming a 3 dimensional execution substrate.
By having the compiler assign data dependent instructions to execution units that are physically close to one another the communication overhead on the chip can be reduced. The individual operations wait for the operands to arrive at their assigned execution unit, once all of operations dependencies are available then the operation fires and its result is forwarded to any waiting instruction. In this way the operations making up the TRIPS are dynamically scheduled according to the data flow of the block and the amount of communications that have to occur across large distances are limited. Once an entire block is executed its can be retired and its results can be written to a register or memory.
At the block level a TRIPS processor can still function much like a conventional processor. Blocks can be executed out of order, speculatively, or in parallel. They have also defined TRIPS as a polymorphous architecture meaning the configuration and execution dynamics can be changed to best leverage the available parallelism. If code is highly parallelizable it might make sense to allow bigger blocks mapped. However, by performing these type of operations at the level of a block instead of for each individual instruction the overhead is theoretically drastically reduced.
There is some flexibility in how the hardware can be utilized. For some types of software with a high degree of parallelism you may want very large blocks, when there is less data level parallelism available it may be better to schedule multiple blocks onto the substrate simultaneously. I'm not sure how the prototype is implemented but the designers have several papers available where they discuss how a TRIPS style architecture can be adapted to perform well on a wide gamut of software.
whatever happened to the laser cpu developed in iran years ago?
just move just move along now it's just more nothing to see here, along now nothing to see here,parallel processing it's just move along now it's just nothing to see here, just more parallel processing just move more parallel along now nothing to see here, more parallel processing it's just processing
Absolute statements are never true
Skynet
- Just my $0.02, take with a grain of salt, your mileage may vary.
Come on guys. TRIPS has been around for something like 4 years.
but will it play Doom?
Cool... I have lots of explicit data!
...I guess this is gonna be trippy, all right.
wake me up when they have trillions of jumps and memory reads/writes...
And it's true. You build a sea of ALUs and you sic some folks on hand coding all sorts of things to the machine, and you end up with some spectacular results.
The problem is that we still can't get a compiler to do a good job at it, for the most part. We thought we could, and we threw every bell and whistle into IA64
for a compiler-controlled architecture, and you've seen what we've ended up with. Many years later, the situation is still pretty much the same: the compiler
can't do all that great of a job with these sorts of machines.
Don't get me wrong, there are lots of good ideas in TRIPS or any of the various other academic projects like it, but I'm yet to be convinced that it's useful in
any kind of real codebase that's not coded by hand by an army of graudate students. For some tasks, that's an acceptable model -- It's been the model in the world of
signal processing for quite a while (though becoming less so daily) -- but for most mainstream applications it just won't fly.
That, and it's hard for compilers to have knowledge about history. It's terribly important for optimization, and it's just hard to get into the compiler (though relatively
easy to get into a branch predictor).
-- Erich
Slashdot reader since 1997
Depends on the compiler. This sounds like MLRISC should have no problem targeting it.
Inventions have long since reached their limit, and I see no hope for further development.-- Frontinus, 1st cent. AD
X86 does not rule the world, you know!
- Xbox360 contains 3 dual-core 3.2Ghz PPC processors (i.e. 6 threads).
- PS3 contains a Cell processor (one general purpose core and 7 SPUs that are basically DSPs).
- Wii runs on the same kind of chip as the GameCube (can't remember, but I think it's a PPC?)
By the way, IBM makes the processors for all three of the next-gen game consoles I just listed. All three of them use graphics processors made by one of NVidia, ATI (I can't remember which ones use which).
- Many embedded devices (routers, cell phones, etc) use Mips, ARM or PPC processors.
- Big iron servers often use non-x86 processors (POWER and PPC for example).
Yes, X86-style processors continue to dominate in the desktop computer market, but there are a LOT of other processors out there that use other designs, because they are cheaper or lower power, and compatibility with existing legacy Wintel software is not needed so much in those markets.
Theres only a few things that need 10-500x increase in speed.
Video transcoding.
Rendering farms - need a $500 solution that can out do a $35,000 solution. ie, 10 x $35 chips on a $20 card + profit margin and yearly software licence.
Folding type apps.
Nuclear/Sci sims.
Liberty freedom are no1, not dicks in suits.
Is this anything like the last GPP prototype we heard about? The one that had a brain the size of a planet, but anything you connected to it tried to committed suicide?
Strength through redundancy and over-design
http://www.youtube.com/watch?v=JLQzfdCs_HU
... A Beowulf cluster of these!
... it had to be said!)
(sorry
N.
Electronic Music Made Using Linux http://soundcloud.com/polyp
Programmers are having enough problems writing code for the PS3 at the moment. Sony's support doesn't help game companies are fidning they are having ask IBM to help since Sony know only slightly more than they do about it. Just recently one the lead PS3 launch managers has set off to make an offshoot company to try and do some cool stuff with the Cell chip. It's almost as if they are saying well if we don't do it who will? TRIP's is still in the research stage there will be very little in terms of libraries to take advantage of this and even less api's for anything like EA to get the teeth into. Wake me up when this becomes news please.
When you can show me a distributed memory parallel weather forecasting or climate prediction code written in Haskell, i.e. something that runs and scales well on large Linux clusters, has high interprocessor communication needs (both in terms of latency and bandwidth), and does a metric assload of floating point computations, I'll start to get interested. If you don't want to go so far as to include all the physics that go into weather and/or climate, just show me a Navier-Stokes simulator that has all those properies.
Yes...I am a rocket scientist.
After trips are quips, clearly, quadrillians + instructions per second.
AI humor is unavoidable at this point.
There's your answer. All it has to do is run the existing code reasonably fast (as in, not too much slower than x86), and people will buy them, especially when they see hot new stuff coming out for it.
I don't think these are the real deal, but if they were -- if we did suddenly have a CPU that was ludicrously faster than our best x86 -- probably the first thing that would happen is, someone would port Linux+Qemu to it, and benchmark Windows in that vs Windows on real x86.