Next Generation Chip Research

pressing problems by ed__ · 2005-10-03 18:03 · Score: 5, Funny

apprently, one of the pressing problems that chip designers are facing is coming up with stupid, meaningless acronyms.

Re:pressing problems by shanen · 2005-10-03 18:14 · Score: 3, Interesting

Small world, eh? A comment about the acronym, and my first reaction to the article was to remember TRAC, the Texas Reconfigurable Array Computer, which was something they were working on at the same school many years ago. Well, at least they didn't need "Texas" for the acronym this time, but I doubt anyone else remembers TRAC now.
Disclaimer: In spite of having a degree from the school, I have a very low opinion of it. Yeah, it's large enough physically, and they had some oil money, but IMO they optimized towards narrow-minded mind-narrowing efficiency rather than breadth. Real education is about the breadth. Unfortunately, these days I feel as though my real alma mater seems to be following a similar path to mediocrity.

--
Freedom = (Meaningful - Coerced) Choice != (Speech | Beer^2), and sad sock puppets' bad mods avail them naught.

Is this simply a VLIW architecture? by Anakron · 2005-10-03 18:03 · Score: 4, Insightful

It doesn't actually look any different. 128 instruction per "block" executed in parallel, just like a superscalar processor. This has been around since the time of the Pentiums (The pentiums weren't VLIW, though). What exactly is new?

--
There are 11 types of people. Those who understand binary, those who don't and those who are sick of this lame joke.

Re:Is this simply a VLIW architecture? by nzkbuk · 2005-10-03 18:41 · Score: 3, Informative

The thing that's new (if you read the article) is the instructions AREN'T executed specifically in a parallel fashon.
They are executed in a JIT (just in time) fashon.
currently with deep pipelines results can get stored in registers for a few cycles. this aims to execute instructions as soon as it can. That way it's needing alot less registers to store results.

It's also meaning instructions are executed out of order AND in parallel in an effort to both increase speed and decrease chip complexity.
If you don't have to use a transistor for storage / control, you can use it for the good bits, generating your answer.
Re:Is this simply a VLIW architecture? by Takahashi · 2005-10-03 20:41 · Score: 4, Informative

It really is different. Its not simply a super scaler. It's a data-flow machine. What this means is that instructions are arranged in a graph based on dependency and execute as soon as all inputs are ready.

I work in a lab at the University of Washington where we are working on _implementing_ a different data flow machine that shares some of fundamentals with the UT machine.

Branching by shmlco · 2005-10-03 18:05 · Score: 2, Interesting

The article states that this works by sending blocks of up to 128 instructions at a time to the processor, where "The processor "sees" and executes a block all at once, as if it were a single instruction..." Makes you wonder if they'd ever get close to that target, as IIRC, one instruction in seven on average is a conditional branch.

--
Any sect, cult, or religion will legislate its creed into law if it acquires the political power to do so.

Re:Branching by Anakron · 2005-10-03 18:11 · Score: 4, Informative

Branches can be predicted with fairly high accuracy. And most new architectures have some form of speculation in the core. And they actually execute 16 instructions at once. Only their word is 128 instructions long.

--
There are 11 types of people. Those who understand binary, those who don't and those who are sick of this lame joke.

Re:No by Anakron · 2005-10-03 18:05 · Score: 2, Funny

Or perhaps it was the designers tripping..

--
There are 11 types of people. Those who understand binary, those who don't and those who are sick of this lame joke.

Loops as functions? by ReformedExCon · 2005-10-03 18:12 · Score: 4, Interesting

We can understand easily how a loop could be calculated as a function, if the contents of the loop block is composed solely of calculations. When this occurs, the output of the loop is simply a function of its input (f(x), if you will). However, computer scientists who think that programs can always be reduced to a simple function with given inputs have their heads too far in their books to see how the real world forces programs to be far removed from that ivory tower gobbledygook.

In the real world, you aren't typically performing calculations in loops. Rather, you are usually reading and writing to memory, which may or may not be cached. So it isn't just a matter of saying f(x), it is much more complicated and possibly dependent on memory which you have no way to determine until the loop iteration reaches that point. And then you'll still get the bottlenecks which plague us today. Memory isn't fast enough, devices aren't fast enough, too much time is spent waiting for I/O to complete.

Pushing as much brute-force computation off onto compilers is fine. Let them unroll loops and optimize functions. But what are the limits to this? Can we really optimize our way to 1-step loops? I don't think so, but the DOD seems to think it is possible.

--
Jesus saved me from my past. He can save you as well.

Boring by Rufus211 · 2005-10-03 18:18 · Score: 2, Interesting

So glancing over the article it doesn't look like they're actually doing anything "new." Basically expanding on register renaming, speculitive execution, and the likes which making the cpu's job slighty easier to do it. Also their bit about data flow and "direct target encoding" sounds oddly like this patent by Cray from 1976 (!).

Overal they might make some things marginally more efficient, but they aren't solving any fundamental problems. They're simply moving some around slightly.

Isn't this what Intel tried to do with Merced? by Anonymous Coward · 2005-10-03 18:19 · Score: 3, Insightful

I seem to remember that Intel designed Merced (now the Itanium, known colloquially as the Itanic to reflect how well it's gone in the marketplace) to shift the burden of branch prediction and parallelism to the compiler. Or, in other words, the compiler was expected to mark instructions that were capable of running in parallel, and also to state which branches were likely to be taken.

All a great idea in theory; after all, the compiler should be able to figure out a fair amount of this information just by looking at the flow of data through the instructions (although it may not be so good at branch prediction; I'm not sufficiently strong on compiler theory and branch prediction to talk about that.) However, as can be seen by Itanium's (lack of) market success, the compiler technology just isn't there (or maybe we're using the wrong languages; there are, after all, languages that are designed to be inherently parallel.)

If this team can get it working the way they want to, maybe -- just maybe -- Itanium will find its niche after all. But let's not kid ourselves; this is a hard problem, and it's more likely that they'll make incremental improvements to the knowledge that's out there, rather than a major breakthrough.

Reduction in register use by Cave_Monster · 2005-10-03 18:20 · Score: 2, Interesting

FTA ... Finally, data flow execution is enabled by "direct target encoding," by which the results from one instruction go directly to the next consuming instruction without being temporarily stored in a centralized register file.

This sounds really cool.

It uses the LSD technology ... by Articuno · 2005-10-03 18:26 · Score: 5, Funny

Bugs on the chip can lead to bad Trips

--
So Long and Thanks for All the Fish!

In other news. by TCaM · 2005-10-03 18:26 · Score: 3, Funny

Their NEXT next generations chips will be powered entirely by buzzwords and acronyms.

Re:I don't get it... by ReformedExCon · 2005-10-03 18:32 · Score: 4, Interesting

I alluded to this in my earlier post. Some mathematical operations are simply loops over a seed input. A summation is one example. You can reduce the calculation of a summation from a long series (infinite, perhaps) of functions executed in a loop to a single function which is valid for all inputs (voila, Calculus).

So they say they can take loops in 128 blocks at a time and calculate the result in less than 128 loop steps. They are requiring the compiler to come up with a valid function for those 128 steps that will work for any initial parameters. If it works, it means that you are no longer executing 128 time, but only once. That is a speed-up of just over 2 orders of magnitude. Really, really amazing.

But does it work? Can they really ask the compiler to do that much work? Is the compiler capable of being that smart? The main thing I wonder is how well this works, and how optimized it can get when the main purpose of looping is not to calculate functions but to access memory which is itself not fast.

--
Jesus saved me from my past. He can save you as well.

Some contradictions in TFA by Gopal.V · 2005-10-03 18:39 · Score: 4, Insightful

> is that for application software to take advantage of those multiple cores, programmers must structure
> their code for parallel processing, and that's difficult or impossible for some applications.
>
> "The industry is running into a programmability wall, passing the buck to software and hoping the programmer
> will be able to write codes for their systems," he says.

So you want the programmer to be unaware of the parallel processing. Then the article goes off and says something stupid IMHO.

> a huge amount of control logic, control transistors that don't do any work -- they just consume power. Trips is trying to push some of that complexity back up into the compiler

I thought the point of TRIPS was to make the chip do all the scheduling (ie the Data Flow architecture) rather than depend on the compiler generated sequence of instructions. As a hobbyist compiler dev, I'd like to note that the data flow architecture is the basis of all compiler optimizers (DAG), though the typical compiler dev is likely to use this input to allocate registers to minimize pipeline stalls. I admit that it can be done at the CPU level to some extent - then this is even stranger.

> Trips compiler sends executable code to the hardware in blocks of up to 128 instructions. The processor "sees" and executes a block all at once, as if it were a single instruction, greatly decreasing the overhead associated

Somehow this just shifts the hard work of peephole optimisation to the CPU to be done at real time. It would have been far better to do it in the compiler properly - something which needs extra memory and lots more processing than the code that is being executed.

All in all, I don't see this thing revolutionizing General purpose programming systems. Though what I call special purpose programming might be the way the future of programming might go - I'm no Gordon Moore.

--
Quidquid latine dictum sit, altum videtur

Re:I don't get it... by freidog · 2005-10-03 18:48 · Score: 2, Informative

only if the subsequent loops are dependant on data from the current loop.

something like
for(int i = n-1; i>0; i--){ n = n * i }

obviously the new value of n depends on the value for n calculated by the last loop so that might not be a good candidate to try and parallelize. (actually factorial is something that can be written to take advantage of instruction level parallelism (ILP), I choose not too simply for the example).

however, if you're doing something that is not dependant on previous loops, various forms of loop unrolling can exploit ILP.
take for example blending two images
for each row x and each column y, x++, y++
imageTarget[x][y] = 1/2 * imageSrc1[x][y] + 1/2 * imageSrc2[x][y]

one pixel does not depend on the result of the previous, there's no reason you can't do 2, 4, 8, 16 ect pixels inside each loop.
Some compilers can take advantage of this already in doing loop unrolling to utilize MMX or SSE (or similar SIMD instruction sets) instructions. It seems like Trips is an instruction set designed to aid the processor in finding and exploiting such ILP.
The usefullness of such massively parallel designs in general purpose computing is debatable I would say. On the whole there tend to be a lot more instructions with dependancies than those without. (obviously everything has some dependancies, I mean in such a manner that prevents ILP / loop unrolling).
Hardware has been moving towards more parallelism with super-scalar and multi-chip processing and more functional SIMD instruction sets, but software has gone only kicking and screaming into a more parallel world.
Athlon and Pentium 3, Pentium M can look at up to something like 14 x86 instructions and decode up to 3 of them per clock cycle. More often than not they can't find 3 suitable instructions to decode. I have a hard time believeing something is going to find 32 (16 per core, 2 cores on the prototype) for general purpose software.

Old idéas? by Anzya · 2005-10-03 18:55 · Score: 2, Insightful

This looks to me to be a combination of old and not so good idéas.
I have read about out of order execution and using data when ready at least 5 years ago in Hennesy and Pattersons book "Computer Architecture A Quantitative Approach". To me it sounds like a typical scoreboarding architecture.
And how he can claim that this will lead to less control logic someone else might be able to explain to me.
As for executing two instruction at once since their destination and value are the same sounds like a operation that will lead to more control logic. Besides doesnt most compilers optimize away these kinds of cases?

--
"This message was brought to you by Sarcasm and Troll Feeders United (or STFU, for you un-hip people)."

VLIW (superscalar) ? by silverbyte · 2005-10-03 18:56 · Score: 3, Interesting

IS it just me, or does this approach sound very similar to VLIW (http://en.wikipedia.org/wiki/VLIW) architecture. The problem is that the branch prediction needs to be very accurate, for any kind of performance boost.
Which is why these types of architecture lend very well to sequences of operations that are very similar (video processing, etc.).
Will this work just as well in the general-computing sphere? No idea.

Re:Boring (article, not project) by Rufus211 · 2005-10-03 19:10 · Score: 4, Informative

So after looking into their project page I realized I actually saw a presentation given by these people last year. The article makes this sound like something it completely is not. Basically it's a grid of functional units that can connect to their neighbors. You "program" the chip by telling node 1 and 2 to take inputs and invert them, then feed the output to node 3, which then multiplies the two inputs. Really it's a glorified DSP that has some interesting programmability. Their code analyzation to generate the DSP code and then schedule it across a 3d matrix (2d function array x time) will certainly be interesting.

What this is *not* in any form is a general purpose CPU. It won't boot linux, plain and simple. This is for doing stream data processing such as compression or HPC simulations. I seem to remember in their presentation showing a prototype doing software-radio at a data rate usable for 802.11.

Re:They have CS Programs in Texas? by Deflatamouse! · 2005-10-03 19:12 · Score: 2, Informative

Don't look down on the Texans. It has one of the highest ranked computer engineer programs in the country. I've heard of Doug Berger before and we have read his research papers and use his simulators (made between him and Todd Austin of Wisconsin) in our graduate classes at CMU (I'm BS&MS ECE, CS '01).

Austin also has a high number of tech companies around - heck, AMD, IBM, Intel, Freescale, just to name a few. It's nicknamed Silicon Hills. UT may not have the legacies like that of MIT, CMU, Berkeley, Stanford, but they got a heck of a program going on there and they are catching up. Hook'em Horns!

Re:They have CS Programs in Texas? by spauldo · 2005-10-03 19:52 · Score: 2, Insightful

Why wouldn't they have CS programs in Texas?

What, you think all they teach at Texas univiersities is agriculture and oil-related subjects?

Don't judge Texas until you've spent some time there. I hate the place, but I'm from Oklahoma where hating Texas is a requirement of citizenship.

--
Those who can't do, teach. Those who can't teach either, do tech support.

Parallel processing by pubjames · 2005-10-03 19:55 · Score: 4, Insightful

I had an interesting discussion with a chip designer the other day. We were talking about parallel processing, and I spouted the usual perceived wisdom "But isn't the problem with parallel processing that many problems are very difficult or impossible to do in parallel? And isn't programming in parallel really difficult?"

I found his answer very interesting, something like "that line of thinking comes from when computers weren't fast enough to do the basic things we wanted to do with them to do then. It's true, an application like a word processor is not a good problem to tackle with parallel processing - but we don't need to these days. Nearly all the stuff we want to do today - faster graphics, 3D video image and sound processing, processing massive amounts of data on the web, all the processing that goes into keeping the internet and telephone networks going - all of these problems are idea for parallel processing. What Google does - that's essentially parallel processing, isn't it?"

That kind of changed my perception of things and made me realise my mindset was way out of date.

Re:Parallel processing by TecKnow · 2005-10-03 22:26 · Score: 2, Informative

Keep in mind that Google doesn't rely on the newest, fastest chips, they instead rely on numerous inexpensive paralell systems. Read their tech page if you don't believe me. http://www.google.com/intl/en/corporate/tech.html so "processing massive amounts of data on the web" doesn't really apply. In fact there are no or nearly no problems in computer science that benifit from using multiple cores or CPU's in a way that can't be replicated using a cluster or grid approach. While advances are always welcome, this area doesn't exactly cry out for hardware innovation.
Paralellism-on-a-chip doesn't let us do anything we couldn't already do, and most applications that benifit from it are outside the domain of the general consumer. Faster graphics, sound, sure these things might benifit from on-a-chip paralellism, but consider how many PCs the average consumer has. Now consider the number of embedded processors they have weather they know it or not. Their vehicle, the HVAC system in their home, cell phones, radios, televisions and so on. Clearly, embedded processors vastly outnumber PC processors and, as I said, essentially none of these benifit from paralell computing.
Now lets consider the benfits of hardware advances in embedded systems/realtime technologies. The smaller and faster a DSP chip can be the smaller your cell phone can be, and the more information can be packed into a limited singal bandwidth, just as one example, Sounds good to me.
Now lets consider the benfiits of hardware advances in paralellism-on-a-chip. very few because corporations can string together many cheap PCs while outside of video games consumers don't benifit much from paralellism.
Considering the availablity of a cheap and effective substitute to paralellism on a chip, the relative prvalance of embedded systems, and the difference in potential gains from advances in each field, yes, I would say that any list that entirely discounts embedded/realtime systems is 'cherry-picked.'

TRIPS Homepage and original announcement by citanon · 2005-10-03 20:03 · Score: 3, Informative

http://www.cs.utexas.edu/users/cart/trips/index.ht ml

The original announcement came in 2003:

http://www.utexas.edu/opa/news/03newsreleases/nr_2 00308/nr_supercomputer030827.html

Call me bitter, but... by SoupIsGood+Food · 2005-10-03 20:16 · Score: 4, Interesting

It seems to me any serious research into microprocessors will be hampered by the fact that it will be completely inapplicable unless it dumbs itself down to ape the x86 instruction set. All current and future processor design advances will be defined as better and faster ways of making modern silicon pretend it's a member of a chip family that was obsolete when the first President Bush was in office. That's not progress. That's just kind of sad.

Heaven help any researcher if implementing their new chip design requires a new software paradigm that doesn't fit neatly into the OS/Application model, too. We're living in the perpetual now of 2000, and it's some boring shit. I want my future back.

Bah.

SoupIsGood Food

Re:The article is too high level by Deflatamouse! · 2005-10-03 20:20 · Score: 2, Interesting

This is so true. We have designs broken on paper that works perfectly fine in silicon. But of course, on paper we assume the worst case of most things and is probably overly pessimistic.

What ends up happening is that parts are cherry picked before they're sold (with the costs passed down to the customers) or that the parts are binned and sold at different levels such as the case for Intel chips.

Increasingly methods to improve yield rates drive some of the design decisions, sometimes even at the architectural level, especially as the processes continue to shrink.

Re:I don't get it... by kirinyaga · 2005-10-03 20:33 · Score: 2, Informative

actually, as I understand it, the following loop :

for(int i = n-1; i>0; i--){ n = n * i }

is probably internally transformed into the following grid in a 10-instructions TRIPS processor :

read n(transmitted as a & b) => decr a (transmitted as a & d) => comp a,0 => mul a,b (result transmitted as c)
=> decr d (transmitted as d & f) => comp d,0 => mul c,d (result transmitted as e)
=> decr f => comp f,0 => mul e,f

where a,b,c,d,e & f are buses wiring the instructions-grid cells to each other. Each instructions-grid cell can be viewed as a little processor without register that performs the instruction it has been programmed for as soon data is present on its inputs.

You can see in the previous example there is a fair amount of concurrence even with such a simple loop. The "new" thing is the loop unrolling is done by the hardware, not the compiler.

--
Kirinyaga

You've all got the wrong idea by Takahashi · 2005-10-03 20:54 · Score: 5, Interesting

This is not some boring super scaler! Nor is it some vector processor!

in fact this is a complete departure from a von Neuman architecture. The architecture is called a Dataflow architecture. In one sentence a dataflow architecture is one where instruction execution is based on the availability of the instructions inputs not a program counter.

The article does a very bad job at conveying the fact that this is a relatively new idea. Like most reporting they report something thats been in research for some time as a huge breakthrough without describing it at all. Instead its really just an incremental step in dataflow computing research.

I work in a lab at the University of Washington on another dataflow architecture. Its a really interesting idea but it will take some time to develop and you're not going to get one on your desk for some years to come.

I'd really like to know... by vought · 2005-10-03 21:01 · Score: 3, Funny

Is the guy who runs this machine named Captain Trips?

Re:I don't get it... by RootsLINUX · 2005-10-03 21:04 · Score: 4, Interesting

I recommend you read this paper. It gives a great overall picture of what TRIPS is all about and is actually really cool. (I read it about a year ago).

I am an ECE grad student at UT Austin so I know quite well of TRIPS. In fact I often speak with Doug Burger himself because he's the faculty advisor for the UT Marathon team, of which I am a member. (By the way, his name is "Burger" not "Berger"). I think TRIPS is an awesome concept and its exactly the kind of project that I wanted to be a part of when I became a grad student at UT. I also know Steve Keckler because I'm taking his advanced computer architecture course this semester, and we're actually spending a good chunk of time talking about TRIPS (course schedule).

--
Hero of Allacrost, a FOSS RPG for *NIX/*BSD/OS X/Win

Re:dumb question re: branch prediction by TecKnow · 2005-10-03 21:29 · Score: 3, Informative

Sure, I can try. All of this stuff about branch prediction is basically the result of something called 'pipelining.' The rational for pipelining goes something like this: an instruction on a modern computer chip is executed in several stages (fetch, decode, execute, and writeback, in an iconic sense) For any particular instruction you can't begin one stage before you've completed the previous stage. Different stages require different hardware on the chip, so in a non-pipelined CPU some parts of the chip are just sitting there much of the time, that is bad. The reigning solution to resolve this is pipelining. Each of the stages I listed above is segregated, and as an instruction exits one stage, another instruction begins that stage. This is all well and good except, what happens if the instruction being decoded depends on the results of the instruction being executed? The results are unknown, so do you sit and wait? You can get around this problem somewhat by complicating the chip a little to feed the results of in-process computations back to later instructions in the same pipeline that require them. But now you've got a branch, and you can't even tell what instruction to load next until you know what the condition on that branch is going to evaluate to, the best a chip can do in this case is guess (branch prediction) but if you're wrong you have to throw out all the speculative computations you did. Modern processors rely heavily on pipelining so an incorrect guess can set them back significantly, especially if they make a habbit of it.

Re:dumb question re: branch prediction by tlambert · 2005-10-03 21:36 · Score: 3, Interesting

Correct prediction keeps your instruction pipeline full. This is particularly important for code with long pipelines.

Incorrect prediction results in having to back out CPU state from the speculative execution that has already taken place (this is called "squashing" the mispredicted instructions), and effectively this loses the pipeline slots that were used to perform the mispredicted execution. From an outside perspective, these lost slots look like a pipeline latency.

(insert rude comment about GCC #pragma branch hinting and [lack of] basic block reordering to avoid cache busting on PPC here)

-- Terry

Re:Boring (article, not project) by Criffer · 2005-10-03 21:51 · Score: 3, Insightful

Really it's a glorified DSP that has some interesting programmability

Actually, it sounds more like an FPGA. And, since VHDL is turing-equivalent, it would actually be possible to compile C code (such as the Linux kernel) into a gate array and run it on such a chip.

better to take a look at.... by kwikrick · 2005-10-03 22:07 · Score: 3, Informative

the homepage for the TRIPS project: http://www.cs.utexas.edu/users/cart/trips/ because the article doesn't do a good job at explaining the idea, which I think is very interesting. It's not mere branch prediction these people are talking about, and it's more than dumb parallel processing. They are basically fragmenting programs into small dataflow networks.

--
assignment != equality != identity

A chance for pure functional languages to shine. by master_p · 2005-10-03 22:09 · Score: 2, Interesting

Pure functional programming languages will see a tremendous boost from architectures like Trips. In functional programming languages, variables are never assigned, thus making it possible for all parts of an expression to be executed simultaneously. With 128 instructions, it is possible that lots of algorithms that take lots of time when executed sequentially, will take constant time with this new architecture: matrix operations, quicksort, etc.

An easier to program Itanic-workalike? by UnapprovedThought · 2005-10-04 01:10 · Score: 2, Interesting

What this is *not* in any form is a general purpose CPU.

The article doesn't seem to agree:

One of the big challenges to becoming a mainstream commercial processor is compatibility with existing software and systems, especially x86 compatibility, Moore says. But one way to maintain compatibility would be to use Trips as a co-processor, he says. "The general-purpose [x86] processor could offload heavy tasks onto the co-processor while still handling legacy compatibility on its own."

So, it looks like they're trying to get Intel or AMD interested in producing a heterogeneous multi-core unit that includes their trippy core, in the hopes of keeping the number of cores (and their communications overhead) down to a minimum. Intel already has a form of (so-called) instruction-level parallelism with the Itanic, and it didn't work out too well (except maybe for crypto-heavy workloads). It's possible AMD will be mulling it over. One of the things they will have to worry about is whether a compiler can actually be written to use it, FTA:

... the Trips compiler sends executable code to the hardware in blocks of up to 128 instructions.

With 128 instructions to schedule at once, that might provide a chance to actually keep all of the processing units on the chip busy. With the Itanic, it was really a challenge to do that, since you had to pull two floating point instructions out from somewhere in every clock cycle, something that not all workloads could accomplish, and I can see the compiler writers going crazy trying to produce some sorts of ultimately self-defeating hacks trying get that accomplished :)

A vastly better site for information by jd · 2005-10-04 05:54 · Score: 2, Informative

The TRIPS homepage has nine published papers on how this design will work and a schematic diagram of what they're expecting the design to end up looking like. They are also promising simulators and compilers later this year.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)

Slashdot Mirror

Next Generation Chip Research

38 of 174 comments (clear)