Next Generation Chip Research
Nyxs writes to tell us Sci-Tech-Today is reporting that researchers at the University of Texas are taking a new approach to designing microprocessor architecture. Doug Berger, a computer science professor at the University of Texas, and his colleagues hope to solve many of the pressing problems facing chip designers today with the new "microprocessor and instruction set architecture called Trips, or the Teraop Reliable Intelligently Adaptive Processing System."
Branches can be predicted with fairly high accuracy. And most new architectures have some form of speculation in the core. And they actually execute 16 instructions at once. Only their word is 128 instructions long.
There are 11 types of people. Those who understand binary, those who don't and those who are sick of this lame joke.
http://www.cs.utexas.edu/users/cart/trips/
The thing that's new (if you read the article) is the instructions AREN'T executed specifically in a parallel fashon.
They are executed in a JIT (just in time) fashon.
currently with deep pipelines results can get stored in registers for a few cycles. this aims to execute instructions as soon as it can. That way it's needing alot less registers to store results.
It's also meaning instructions are executed out of order AND in parallel in an effort to both increase speed and decrease chip complexity.
If you don't have to use a transistor for storage / control, you can use it for the good bits, generating your answer.
only if the subsequent loops are dependant on data from the current loop.
something like
for(int i = n-1; i>0; i--){ n = n * i }
obviously the new value of n depends on the value for n calculated by the last loop so that might not be a good candidate to try and parallelize. (actually factorial is something that can be written to take advantage of instruction level parallelism (ILP), I choose not too simply for the example).
however, if you're doing something that is not dependant on previous loops, various forms of loop unrolling can exploit ILP.
take for example blending two images
for each row x and each column y, x++, y++
imageTarget[x][y] = 1/2 * imageSrc1[x][y] + 1/2 * imageSrc2[x][y]
one pixel does not depend on the result of the previous, there's no reason you can't do 2, 4, 8, 16 ect pixels inside each loop.
Some compilers can take advantage of this already in doing loop unrolling to utilize MMX or SSE (or similar SIMD instruction sets) instructions. It seems like Trips is an instruction set designed to aid the processor in finding and exploiting such ILP.
The usefullness of such massively parallel designs in general purpose computing is debatable I would say. On the whole there tend to be a lot more instructions with dependancies than those without. (obviously everything has some dependancies, I mean in such a manner that prevents ILP / loop unrolling).
Hardware has been moving towards more parallelism with super-scalar and multi-chip processing and more functional SIMD instruction sets, but software has gone only kicking and screaming into a more parallel world.
Athlon and Pentium 3, Pentium M can look at up to something like 14 x86 instructions and decode up to 3 of them per clock cycle. More often than not they can't find 3 suitable instructions to decode. I have a hard time believeing something is going to find 32 (16 per core, 2 cores on the prototype) for general purpose software.
What this is *not* in any form is a general purpose CPU. It won't boot linux, plain and simple. This is for doing stream data processing such as compression or HPC simulations. I seem to remember in their presentation showing a prototype doing software-radio at a data rate usable for 802.11.
Don't look down on the Texans. It has one of the highest ranked computer engineer programs in the country. I've heard of Doug Berger before and we have read his research papers and use his simulators (made between him and Todd Austin of Wisconsin) in our graduate classes at CMU (I'm BS&MS ECE, CS '01).
Austin also has a high number of tech companies around - heck, AMD, IBM, Intel, Freescale, just to name a few. It's nicknamed Silicon Hills. UT may not have the legacies like that of MIT, CMU, Berkeley, Stanford, but they got a heck of a program going on there and they are catching up. Hook'em Horns!
The original announcement came in 2003:
http://www.utexas.edu/opa/news/03newsreleases/nr_for(int i = n-1; i>0; i--){ n = n * i }
is probably internally transformed into the following grid in a 10-instructions TRIPS processor :read n(transmitted as a & b) => decr a (transmitted as a & d) => comp a,0 => mul a,b (result transmitted as c)
where a,b,c,d,e & f are buses wiring the instructions-grid cells to each other. Each instructions-grid cell can be viewed as a little processor without register that performs the instruction it has been programmed for as soon data is present on its inputs.=> decr d (transmitted as d & f) => comp d,0 => mul c,d (result transmitted as e)
=> decr f => comp f,0 => mul e,f
You can see in the previous example there is a fair amount of concurrence even with such a simple loop. The "new" thing is the loop unrolling is done by the hardware, not the compiler.
Kirinyaga
It really is different. Its not simply a super scaler. It's a data-flow machine. What this means is that instructions are arranged in a graph based on dependency and execute as soon as all inputs are ready.
I work in a lab at the University of Washington where we are working on _implementing_ a different data flow machine that shares some of fundamentals with the UT machine.
Sure, I can try. All of this stuff about branch prediction is basically the result of something called 'pipelining.' The rational for pipelining goes something like this: an instruction on a modern computer chip is executed in several stages (fetch, decode, execute, and writeback, in an iconic sense) For any particular instruction you can't begin one stage before you've completed the previous stage. Different stages require different hardware on the chip, so in a non-pipelined CPU some parts of the chip are just sitting there much of the time, that is bad. The reigning solution to resolve this is pipelining. Each of the stages I listed above is segregated, and as an instruction exits one stage, another instruction begins that stage. This is all well and good except, what happens if the instruction being decoded depends on the results of the instruction being executed? The results are unknown, so do you sit and wait? You can get around this problem somewhat by complicating the chip a little to feed the results of in-process computations back to later instructions in the same pipeline that require them. But now you've got a branch, and you can't even tell what instruction to load next until you know what the condition on that branch is going to evaluate to, the best a chip can do in this case is guess (branch prediction) but if you're wrong you have to throw out all the speculative computations you did. Modern processors rely heavily on pipelining so an incorrect guess can set them back significantly, especially if they make a habbit of it.
the homepage for the TRIPS project: http://www.cs.utexas.edu/users/cart/trips/ because the article doesn't do a good job at explaining the idea, which I think is very interesting. It's not mere branch prediction these people are talking about, and it's more than dumb parallel processing. They are basically fragmenting programs into small dataflow networks.
assignment != equality != identity
Paralellism-on-a-chip doesn't let us do anything we couldn't already do, and most applications that benifit from it are outside the domain of the general consumer. Faster graphics, sound, sure these things might benifit from on-a-chip paralellism, but consider how many PCs the average consumer has. Now consider the number of embedded processors they have weather they know it or not. Their vehicle, the HVAC system in their home, cell phones, radios, televisions and so on. Clearly, embedded processors vastly outnumber PC processors and, as I said, essentially none of these benifit from paralell computing.
Now lets consider the benfits of hardware advances in embedded systems/realtime technologies. The smaller and faster a DSP chip can be the smaller your cell phone can be, and the more information can be packed into a limited singal bandwidth, just as one example, Sounds good to me.
Now lets consider the benfiits of hardware advances in paralellism-on-a-chip. very few because corporations can string together many cheap PCs while outside of video games consumers don't benifit much from paralellism.
Considering the availablity of a cheap and effective substitute to paralellism on a chip, the relative prvalance of embedded systems, and the difference in potential gains from advances in each field, yes, I would say that any list that entirely discounts embedded/realtime systems is 'cherry-picked.'
The TRIPS homepage has nine published papers on how this design will work and a schematic diagram of what they're expecting the design to end up looking like. They are also promising simulators and compilers later this year.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)