Next Generation Chip Research
Nyxs writes to tell us Sci-Tech-Today is reporting that researchers at the University of Texas are taking a new approach to designing microprocessor architecture. Doug Berger, a computer science professor at the University of Texas, and his colleagues hope to solve many of the pressing problems facing chip designers today with the new "microprocessor and instruction set architecture called Trips, or the Teraop Reliable Intelligently Adaptive Processing System."
apprently, one of the pressing problems that chip designers are facing is coming up with stupid, meaningless acronyms.
It doesn't actually look any different. 128 instruction per "block" executed in parallel, just like a superscalar processor. This has been around since the time of the Pentiums (The pentiums weren't VLIW, though). What exactly is new?
There are 11 types of people. Those who understand binary, those who don't and those who are sick of this lame joke.
The article states that this works by sending blocks of up to 128 instructions at a time to the processor, where "The processor "sees" and executes a block all at once, as if it were a single instruction..." Makes you wonder if they'd ever get close to that target, as IIRC, one instruction in seven on average is a conditional branch.
Any sect, cult, or religion will legislate its creed into law if it acquires the political power to do so.
Or perhaps it was the designers tripping..
There are 11 types of people. Those who understand binary, those who don't and those who are sick of this lame joke.
We can understand easily how a loop could be calculated as a function, if the contents of the loop block is composed solely of calculations. When this occurs, the output of the loop is simply a function of its input (f(x), if you will). However, computer scientists who think that programs can always be reduced to a simple function with given inputs have their heads too far in their books to see how the real world forces programs to be far removed from that ivory tower gobbledygook.
In the real world, you aren't typically performing calculations in loops. Rather, you are usually reading and writing to memory, which may or may not be cached. So it isn't just a matter of saying f(x), it is much more complicated and possibly dependent on memory which you have no way to determine until the loop iteration reaches that point. And then you'll still get the bottlenecks which plague us today. Memory isn't fast enough, devices aren't fast enough, too much time is spent waiting for I/O to complete.
Pushing as much brute-force computation off onto compilers is fine. Let them unroll loops and optimize functions. But what are the limits to this? Can we really optimize our way to 1-step loops? I don't think so, but the DOD seems to think it is possible.
Jesus saved me from my past. He can save you as well.
Overal they might make some things marginally more efficient, but they aren't solving any fundamental problems. They're simply moving some around slightly.
I seem to remember that Intel designed Merced (now the Itanium, known colloquially as the Itanic to reflect how well it's gone in the marketplace) to shift the burden of branch prediction and parallelism to the compiler. Or, in other words, the compiler was expected to mark instructions that were capable of running in parallel, and also to state which branches were likely to be taken.
All a great idea in theory; after all, the compiler should be able to figure out a fair amount of this information just by looking at the flow of data through the instructions (although it may not be so good at branch prediction; I'm not sufficiently strong on compiler theory and branch prediction to talk about that.) However, as can be seen by Itanium's (lack of) market success, the compiler technology just isn't there (or maybe we're using the wrong languages; there are, after all, languages that are designed to be inherently parallel.)
If this team can get it working the way they want to, maybe -- just maybe -- Itanium will find its niche after all. But let's not kid ourselves; this is a hard problem, and it's more likely that they'll make incremental improvements to the knowledge that's out there, rather than a major breakthrough.
This sounds really cool.
Bugs on the chip can lead to bad Trips
So Long and Thanks for All the Fish!
Their NEXT next generations chips will be powered entirely by buzzwords and acronyms.
I alluded to this in my earlier post. Some mathematical operations are simply loops over a seed input. A summation is one example. You can reduce the calculation of a summation from a long series (infinite, perhaps) of functions executed in a loop to a single function which is valid for all inputs (voila, Calculus).
So they say they can take loops in 128 blocks at a time and calculate the result in less than 128 loop steps. They are requiring the compiler to come up with a valid function for those 128 steps that will work for any initial parameters. If it works, it means that you are no longer executing 128 time, but only once. That is a speed-up of just over 2 orders of magnitude. Really, really amazing.
But does it work? Can they really ask the compiler to do that much work? Is the compiler capable of being that smart? The main thing I wonder is how well this works, and how optimized it can get when the main purpose of looping is not to calculate functions but to access memory which is itself not fast.
Jesus saved me from my past. He can save you as well.
> their code for parallel processing, and that's difficult or impossible for some applications.
>
> "The industry is running into a programmability wall, passing the buck to software and hoping the programmer
> will be able to write codes for their systems," he says.
So you want the programmer to be unaware of the parallel processing. Then the article goes off and says something stupid IMHO.
> a huge amount of control logic, control transistors that don't do any work -- they just consume power. Trips is trying to push some of that complexity back up into the compilerI thought the point of TRIPS was to make the chip do all the scheduling (ie the Data Flow architecture) rather than depend on the compiler generated sequence of instructions. As a hobbyist compiler dev, I'd like to note that the data flow architecture is the basis of all compiler optimizers (DAG), though the typical compiler dev is likely to use this input to allocate registers to minimize pipeline stalls. I admit that it can be done at the CPU level to some extent - then this is even stranger.
> Trips compiler sends executable code to the hardware in blocks of up to 128 instructions. The processor "sees" and executes a block all at once, as if it were a single instruction, greatly decreasing the overhead associatedSomehow this just shifts the hard work of peephole optimisation to the CPU to be done at real time. It would have been far better to do it in the compiler properly - something which needs extra memory and lots more processing than the code that is being executed.
All in all, I don't see this thing revolutionizing General purpose programming systems. Though what I call special purpose programming might be the way the future of programming might go - I'm no Gordon Moore.Quidquid latine dictum sit, altum videtur
only if the subsequent loops are dependant on data from the current loop.
something like
for(int i = n-1; i>0; i--){ n = n * i }
obviously the new value of n depends on the value for n calculated by the last loop so that might not be a good candidate to try and parallelize. (actually factorial is something that can be written to take advantage of instruction level parallelism (ILP), I choose not too simply for the example).
however, if you're doing something that is not dependant on previous loops, various forms of loop unrolling can exploit ILP.
take for example blending two images
for each row x and each column y, x++, y++
imageTarget[x][y] = 1/2 * imageSrc1[x][y] + 1/2 * imageSrc2[x][y]
one pixel does not depend on the result of the previous, there's no reason you can't do 2, 4, 8, 16 ect pixels inside each loop.
Some compilers can take advantage of this already in doing loop unrolling to utilize MMX or SSE (or similar SIMD instruction sets) instructions. It seems like Trips is an instruction set designed to aid the processor in finding and exploiting such ILP.
The usefullness of such massively parallel designs in general purpose computing is debatable I would say. On the whole there tend to be a lot more instructions with dependancies than those without. (obviously everything has some dependancies, I mean in such a manner that prevents ILP / loop unrolling).
Hardware has been moving towards more parallelism with super-scalar and multi-chip processing and more functional SIMD instruction sets, but software has gone only kicking and screaming into a more parallel world.
Athlon and Pentium 3, Pentium M can look at up to something like 14 x86 instructions and decode up to 3 of them per clock cycle. More often than not they can't find 3 suitable instructions to decode. I have a hard time believeing something is going to find 32 (16 per core, 2 cores on the prototype) for general purpose software.
This looks to me to be a combination of old and not so good idéas.
I have read about out of order execution and using data when ready at least 5 years ago in Hennesy and Pattersons book "Computer Architecture A Quantitative Approach". To me it sounds like a typical scoreboarding architecture.
And how he can claim that this will lead to less control logic someone else might be able to explain to me.
As for executing two instruction at once since their destination and value are the same sounds like a operation that will lead to more control logic. Besides doesnt most compilers optimize away these kinds of cases?
"This message was brought to you by Sarcasm and Troll Feeders United (or STFU, for you un-hip people)."
IS it just me, or does this approach sound very similar to VLIW (http://en.wikipedia.org/wiki/VLIW) architecture. The problem is that the branch prediction needs to be very accurate, for any kind of performance boost.
Which is why these types of architecture lend very well to sequences of operations that are very similar (video processing, etc.).
Will this work just as well in the general-computing sphere? No idea.
What this is *not* in any form is a general purpose CPU. It won't boot linux, plain and simple. This is for doing stream data processing such as compression or HPC simulations. I seem to remember in their presentation showing a prototype doing software-radio at a data rate usable for 802.11.
Don't look down on the Texans. It has one of the highest ranked computer engineer programs in the country. I've heard of Doug Berger before and we have read his research papers and use his simulators (made between him and Todd Austin of Wisconsin) in our graduate classes at CMU (I'm BS&MS ECE, CS '01).
Austin also has a high number of tech companies around - heck, AMD, IBM, Intel, Freescale, just to name a few. It's nicknamed Silicon Hills. UT may not have the legacies like that of MIT, CMU, Berkeley, Stanford, but they got a heck of a program going on there and they are catching up. Hook'em Horns!
Why wouldn't they have CS programs in Texas?
What, you think all they teach at Texas univiersities is agriculture and oil-related subjects?
Don't judge Texas until you've spent some time there. I hate the place, but I'm from Oklahoma where hating Texas is a requirement of citizenship.
Those who can't do, teach. Those who can't teach either, do tech support.
I had an interesting discussion with a chip designer the other day. We were talking about parallel processing, and I spouted the usual perceived wisdom "But isn't the problem with parallel processing that many problems are very difficult or impossible to do in parallel? And isn't programming in parallel really difficult?"
I found his answer very interesting, something like "that line of thinking comes from when computers weren't fast enough to do the basic things we wanted to do with them to do then. It's true, an application like a word processor is not a good problem to tackle with parallel processing - but we don't need to these days. Nearly all the stuff we want to do today - faster graphics, 3D video image and sound processing, processing massive amounts of data on the web, all the processing that goes into keeping the internet and telephone networks going - all of these problems are idea for parallel processing. What Google does - that's essentially parallel processing, isn't it?"
That kind of changed my perception of things and made me realise my mindset was way out of date.
The original announcement came in 2003:
http://www.utexas.edu/opa/news/03newsreleases/nr_It seems to me any serious research into microprocessors will be hampered by the fact that it will be completely inapplicable unless it dumbs itself down to ape the x86 instruction set. All current and future processor design advances will be defined as better and faster ways of making modern silicon pretend it's a member of a chip family that was obsolete when the first President Bush was in office. That's not progress. That's just kind of sad.
Heaven help any researcher if implementing their new chip design requires a new software paradigm that doesn't fit neatly into the OS/Application model, too. We're living in the perpetual now of 2000, and it's some boring shit. I want my future back.
Bah.
SoupIsGood Food
This is so true. We have designs broken on paper that works perfectly fine in silicon. But of course, on paper we assume the worst case of most things and is probably overly pessimistic.
What ends up happening is that parts are cherry picked before they're sold (with the costs passed down to the customers) or that the parts are binned and sold at different levels such as the case for Intel chips.
Increasingly methods to improve yield rates drive some of the design decisions, sometimes even at the architectural level, especially as the processes continue to shrink.
for(int i = n-1; i>0; i--){ n = n * i }
is probably internally transformed into the following grid in a 10-instructions TRIPS processor :read n(transmitted as a & b) => decr a (transmitted as a & d) => comp a,0 => mul a,b (result transmitted as c)
where a,b,c,d,e & f are buses wiring the instructions-grid cells to each other. Each instructions-grid cell can be viewed as a little processor without register that performs the instruction it has been programmed for as soon data is present on its inputs.=> decr d (transmitted as d & f) => comp d,0 => mul c,d (result transmitted as e)
=> decr f => comp f,0 => mul e,f
You can see in the previous example there is a fair amount of concurrence even with such a simple loop. The "new" thing is the loop unrolling is done by the hardware, not the compiler.
Kirinyaga
This is not some boring super scaler! Nor is it some vector processor!
in fact this is a complete departure from a von Neuman architecture. The architecture is called a Dataflow architecture. In one sentence a dataflow architecture is one where instruction execution is based on the availability of the instructions inputs not a program counter.
The article does a very bad job at conveying the fact that this is a relatively new idea. Like most reporting they report something thats been in research for some time as a huge breakthrough without describing it at all. Instead its really just an incremental step in dataflow computing research.
I work in a lab at the University of Washington on another dataflow architecture. Its a really interesting idea but it will take some time to develop and you're not going to get one on your desk for some years to come.
Is the guy who runs this machine named Captain Trips?
I recommend you read this paper. It gives a great overall picture of what TRIPS is all about and is actually really cool. (I read it about a year ago).
I am an ECE grad student at UT Austin so I know quite well of TRIPS. In fact I often speak with Doug Burger himself because he's the faculty advisor for the UT Marathon team, of which I am a member. (By the way, his name is "Burger" not "Berger"). I think TRIPS is an awesome concept and its exactly the kind of project that I wanted to be a part of when I became a grad student at UT. I also know Steve Keckler because I'm taking his advanced computer architecture course this semester, and we're actually spending a good chunk of time talking about TRIPS (course schedule).
Hero of Allacrost, a FOSS RPG for *NIX/*BSD/OS X/Win
Sure, I can try. All of this stuff about branch prediction is basically the result of something called 'pipelining.' The rational for pipelining goes something like this: an instruction on a modern computer chip is executed in several stages (fetch, decode, execute, and writeback, in an iconic sense) For any particular instruction you can't begin one stage before you've completed the previous stage. Different stages require different hardware on the chip, so in a non-pipelined CPU some parts of the chip are just sitting there much of the time, that is bad. The reigning solution to resolve this is pipelining. Each of the stages I listed above is segregated, and as an instruction exits one stage, another instruction begins that stage. This is all well and good except, what happens if the instruction being decoded depends on the results of the instruction being executed? The results are unknown, so do you sit and wait? You can get around this problem somewhat by complicating the chip a little to feed the results of in-process computations back to later instructions in the same pipeline that require them. But now you've got a branch, and you can't even tell what instruction to load next until you know what the condition on that branch is going to evaluate to, the best a chip can do in this case is guess (branch prediction) but if you're wrong you have to throw out all the speculative computations you did. Modern processors rely heavily on pipelining so an incorrect guess can set them back significantly, especially if they make a habbit of it.
Correct prediction keeps your instruction pipeline full. This is particularly important for code with long pipelines.
Incorrect prediction results in having to back out CPU state from the speculative execution that has already taken place (this is called "squashing" the mispredicted instructions), and effectively this loses the pipeline slots that were used to perform the mispredicted execution. From an outside perspective, these lost slots look like a pipeline latency.
(insert rude comment about GCC #pragma branch hinting and [lack of] basic block reordering to avoid cache busting on PPC here)
-- Terry
Actually, it sounds more like an FPGA. And, since VHDL is turing-equivalent, it would actually be possible to compile C code (such as the Linux kernel) into a gate array and run it on such a chip.
the homepage for the TRIPS project: http://www.cs.utexas.edu/users/cart/trips/ because the article doesn't do a good job at explaining the idea, which I think is very interesting. It's not mere branch prediction these people are talking about, and it's more than dumb parallel processing. They are basically fragmenting programs into small dataflow networks.
assignment != equality != identity
Pure functional programming languages will see a tremendous boost from architectures like Trips. In functional programming languages, variables are never assigned, thus making it possible for all parts of an expression to be executed simultaneously. With 128 instructions, it is possible that lots of algorithms that take lots of time when executed sequentially, will take constant time with this new architecture: matrix operations, quicksort, etc.
The article doesn't seem to agree:
So, it looks like they're trying to get Intel or AMD interested in producing a heterogeneous multi-core unit that includes their trippy core, in the hopes of keeping the number of cores (and their communications overhead) down to a minimum. Intel already has a form of (so-called) instruction-level parallelism with the Itanic, and it didn't work out too well (except maybe for crypto-heavy workloads). It's possible AMD will be mulling it over. One of the things they will have to worry about is whether a compiler can actually be written to use it, FTA:
With 128 instructions to schedule at once, that might provide a chance to actually keep all of the processing units on the chip busy. With the Itanic, it was really a challenge to do that, since you had to pull two floating point instructions out from somewhere in every clock cycle, something that not all workloads could accomplish, and I can see the compiler writers going crazy trying to produce some sorts of ultimately self-defeating hacks trying get that accomplished :)
The TRIPS homepage has nine published papers on how this design will work and a schematic diagram of what they're expecting the design to end up looking like. They are also promising simulators and compilers later this year.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)