Intel Talks 1000-Core Processors

Jeez... by Joe+Snipe · 2010-11-21 18:54 · Score: 5, Funny

I hope he never works for Gillette.

--
Sometimes, life itself is sarcasm...

Re:Jeez... by Monkey-Man2000 · 2010-11-21 19:10 · Score: 3, Funny

I hope he never works for Gillette.
Obligatory Onion

--
This post was generated by a Cadre of Uber Monkeys for Monkey-Man2000 (603495).
Re:Jeez... by monkeySauce · 2010-11-21 19:15 · Score: 3, Funny

Other way around; he used to work for Gillette. He left after they cancelled his 1000-blade razor project.
Re:Jeez... by Slashcrunch · 2010-11-21 19:43 · Score: 4, Funny

Other way around; he used to work for Gillette. He left after they cancelled his 1000-blade razor project.
Yes, I also heard about the 1000-blade project getting cut...
Re:Jeez... by Bill+Dog · 2010-11-21 21:15 · Score: 4, Funny

...in the nick of time.

--
Attention zealots and haters: 00100 00100

Message passing between cores? Hmm... by PaulBu · 2010-11-21 19:00 · Score: 3, Interesting

Are they trying to reinvent Transputer? :)

But yes, I am happy to see Intel pushing it forward!

Paul B.

Re:Message passing between cores? Hmm... by Anne+Thwacks · 2010-11-21 23:35 · Score: 3, Informative

They were apart from the comms protocol, which was a pile of poo.
IF YOU GOT A COMMS ERROR, THE ONLY RECOVERY MECHANISM WAS A TOTAL SYSTEM REBOOT.
That is as crap as you can get! TPP/IP might be an improvement, but HDLC would have cracked the Transputer's problems, and it was already over 15 years old when the transputer was invented.
Yes I did build a Transputer based system, and yes it did work. (but...)

--
Sent from my ASR33 using ASCII

Could be good for games using raytracing by mentil · 2010-11-21 19:01 · Score: 4, Insightful

This is for server/enterprise usage, not consumer usage. That said, it could scale to the number of cores necessary to make realtime raytracing work at 60fps for computer games. Raytracing could be the killer app for cloud gaming services like OnLive, where the power to do it is unavailable for consumer computers, or prohibitively expensive. The only way Microsoft etc. would be able to have comparable graphics in a console in the next few years is if it were rental-only like the Neo-Geo originally was.

--
Corruption is convincing someone that the selfless ideal is the same as their selfish ideal.

Bring out your Memes! by SixDimensionalArray · 2010-11-21 19:07 · Score: 4, Funny

Imagine a Beowulf cluster of th^H^H^H

Ah, forget it, the darn thing practically is one already! :/

"Imagine exactly ONE of those" just doesn't sound the same.

Re:Bring out your Memes! by rrohbeck · 2010-11-21 20:55 · Score: 3, Funny

I've said it for years: 640K cores ought to be enough for anybody.

--
thegodmovie.com - watch it

accurate representation by pyronordicman · 2010-11-21 19:12 · Score: 5, Interesting

Having been in attendance of this presentation at Supercomputing 2010, for once I can say without a doubt that the article captured the essence of reality. The only part it left out is that the interconnect between all the processing elements uses significantly less energy than that of the previous 80-core chip; I think the figure was around 10% of chip power for the 48-core, and 30% for the 80-core. Oh, and MPI over TCP/IP was faster than the native message passing scheme for large messages.

Future of Programming by igreaterthanu · 2010-11-21 19:31 · Score: 4, Interesting

This just goes to show that if you care about having a future career (or even just continuing with your existing one) in programming, Learn a functional language NOW!

--
I dream of a nation where a man is not judged by his skin color but by an number assigned by a credit rating agency.

Re:Future of Programming by Anonymous Coward · 2010-11-21 20:43 · Score: 5, Insightful

Learn a functional language. Leanr it not for some practical reason. Learn it because having another view will give you interesting choices even when writing imperative languages. Every serious programmer should try to look at the important paradigms so that he can freely choose to use them where appropriate.

Re:Biggest Hurdle Not Cores by Anonymous Coward · 2010-11-21 19:38 · Score: 3, Insightful

Basically, we are going to need compilers that automatically take advantage of all that parallelism without making you think about it too much, and programming languages that are designed to make your programs parallel-friendly. Even Microsoft is finally starting to edge in this direction with F# and some new features of .NET 4.0. Look at Haskell and Erlang for examples of languages that take such things more seriously, even if the world takes them less seriously.

I don't know about AI, but almost certainly we will end up with both compilers and virtual machines that are aware of parallelism and try to take advantage of it whenever possible.

But still, certain algorithms just aren't very friendly to parallelism no matter what technology you apply to them.

Re:does it run Linux - yea but it is "boring" by RAMMS+EIN · 2010-11-21 19:45 · Score: 4, Interesting

Running Linux on a 48-core system is boring, because it has already been run on a 64-core system in 2007 (at the time, Tilera said they would be up to 1000 cores in 2014; they're up to 100 cores per CPU now).

As far as I know, Linux currently supports up to 256 CPUs. I assume that means logical CPUs, so that, for example, this would support one CPU with 256 cores, or one CPU with 128 cores with two CPU threads per core, etc.

--
Please correct me if I got my facts wrong.

1000 cores is easy! by Jason+Kimball · 2010-11-21 19:46 · Score: 5, Funny

1000 cores on a chip isn't too bad. I already have one with 110 cores.

That's only 10 more cores!

Re:1000 cores is easy! by roman_mir · 2010-11-22 00:26 · Score: 3, Funny

That has got to be the funniest thing I've read here in a month.
- jesus, that must have been one sad month.

--
You can't handle the truth.

Instruction set... by KonoWatakushi · 2010-11-21 19:47 · Score: 3, Insightful

"Performance on this chip is not interesting," Mattson said. It uses a standard x86 instruction set.

How about developing a small efficient core, where the performance is interesting? Actually, don't even bother; just reuse the DEC Alpha instruction set that is collecting dust at Intel.

There is no point in tying these massively parallel architectures to some ancient ISA.

Re:Instruction set... by kohaku · 2010-11-21 21:47 · Score: 4, Insightful

There's also no reason to throw away an ISA that has proven to be extremely scalable and very successful, just because it's ancient or it looks ugly.
Uh, scalable? Not really... The only reason x86 is still around (i.e. successful) is because it's pretty much backwards compatible since the 8086- which is over THIRTY YEARS OLD.

The advantage of the x86 instruction set is that it's very compact. It comes at a price of increased decoding complexity, but that problem has already been solved.
Whoa nelly. compact? I'm not sure where you got that idea, but it's called CISC and not RISC for a reason! if you think x86 is compact, you might be interested to find out that you can have a fifteen byte instruction In fact, on the i7 line, the instructions are so complex it's not even worth writing a "real" decoder- they're translated in real-time into a RISC instruction set! If Intel would just abandon x86, they could reduce their cores by something like 50%!
The low number of registers _IS_ a problem. The only reason there are only four is because of backwards compatability. It definitely is a problem for scalability, one cannot simply rely on a shared memory architecture to scale vertically indefinitely, you just use too much power as a die size increases, and memory just doesn't scale up as fast as the number of transistors on a CPU.
A far better approach is to have a decent model of parallelism (CSP, Pi-calculus, Ambient calculus) underlying the architecture and to provide a simple architecture with primitives supporting features of these calculi, such as channel communication. There are plenty of startups doing things like this, not just Intel, and they've already products in the market- though not desktop processors. Picochip and Icera to name just a couple, not to mention things like GPGPU (Fermi, etc.)
Really, the way to go is small, simple, low power cores with on-chip networks which can scale up MUCH better than just the old intel method of "More transistors, increase clock speed, bigger cache".
Re:Instruction set... by Arlet · 2010-11-21 22:35 · Score: 3, Insightful

The only reason x86 is still around (i.e. successful) is because it's pretty much backwards compatible since the 8086- which is over THIRTY YEARS OLD.
That's a clear testament to scalability when you consider the speed improvement in the last 30 years using basically the same ISA.

you might be interested to find out that you can have a fifteen byte instruction
So ? It's not the maximum instruction length that counts, but the average. In typical programs that's closer to three. Frequently used opcodes like push/pop only take a single byte. Compare to a DEC Alpha architecture, where nearly every single instruction uses 15 bits just to tell which registers are used, no matter whether a function needs that many registers.

If Intel would just abandon x86, they could reduce their cores by something like 50%!
Even if that's true (I doubt it), who cares ? The problem is not intel has too many transistors for a given area. The problem is just the opposite. They have the capability to put more transistors in a core that they know what to do with. Also, typically half the chip is for the cache memories, and the compact instruction set helps to use that cache memory more effectively.

one cannot simply rely on a shared memory architecture to scale vertically indefinitely
Sure you can. Shared memory architectures can do everything explicit channel communication architectures can do, plus you have the benefit that the communication details are hidden from the programmer, allowing improvements to the implementation without having to rewrite your software. Sure, the hardware is more complex, but transistors are dirt cheap, so I'd rather put the complexity in the hardware.
Re:Instruction set... by kohaku · 2010-11-21 23:24 · Score: 3, Interesting

That's a clear testament to scalability when you consider the speed improvement in the last 30 years using basically the same ISA.
It's scaled that way until now. We've hit a power wall in the last few years: as you increase the number of transistors on chip it gets more difficult to distribute a faster clock synchronously, so you increase the power, which is why Nehalem is so power hungry, and why you haven't seen clock speeds really increase since the P4. In any case, we're talking about parallelism, not just "increasing the clock speed" which isn't even a viable approach anymore.
When you said "Compact" I assumed you meant the instruction set itself was compact rather than the average length- I was talking about the hardware needed to decode, not necessarily code density. Even so, x86 is nothing special when it comes to density, especially considered against things like ARM's Thumb-2.
If you take look at Nehalem's pipeline, there's a significant chunk of it simply dedicated to translating x86 instructions into RISC uops, which is only there for backwards compatability. The inner workings of the chip don't even see x86 instructions.
Sure you can do everything the same with shared memory and channel comms, but if you have a multi-node system, you're going to be doing channel communcation anyway. You also have to consider that memory speed is a bottleneck that just won't go away, and for massive parallelism on-chip networks are just faster. In fact, Intel's QPI and AMD's HyperTransport are examples of on-chip network- they provide a NUMA on Nehalem and whatever AMD have these days. Indeed, in the article, it says

Mattson has argued that a better approach would be to eliminate cache coherency and instead allow cores to pass messages among one another.

The thing is, if you want to put more cores on a die, you need either a bigger die or smaller cores. x86 is stuck with larger cores because of all the translation and prediction it's required to do to be both backwards compatible and reasonably well-performing. If you're scaling horizontally like that, you want the simplest core possible, which is why this chip only has 48 cores, and Clearspeed's 2-year-old CSX700 had 192.
Re:Instruction set... by Arlet · 2010-11-21 23:49 · Score: 4, Interesting

The thing is, if you want to put more cores on a die, you need either a bigger die or smaller cores
Nobody wants to put more cores on a die, but they're forced to do so because they reach the limits of a single core. I'd rather have as few cores as possible, but have each one be really powerful. Once multiple cores are required, I'd want them to stretch the coherent shared memory concept as far as it will go. When that concept doesn't scale anymore, use something like NUMA.
Small, message passing cores have been tried multiple times, and they've always failed. The problem is that the requirement of distributed state coherency doesn't go away. The burden only gets shifted from the hardware to the software, where it is just as hard to accomplish, but much slower. In addition, if you try to tackle the coherency problem in software, you don't get to benefit from hardware improvements.
Re:Instruction set... by kohaku · 2010-11-22 00:33 · Score: 3, Interesting

they're forced to do so because they reach the limits of a single core
Well yes, but you might as well have argued that nobody wanted to make faster cores but they're limited by current clock speeds... The fact is that you can no longer make cores faster and bigger, you have to go parallel. Even the intel researcher in the article is saying the shared memory concept needs to be abandoned to scale up.
Essentially there are two approaches to the problem of performance now. Both use parallelism. The first (Nehalem's) is to have a 'powerful' superscalar core with lots of branch prediction and out-of-order logic to run instructions from the same process in parallel. It results in a few, high performance cores that won't scale horizontally (memory bottleneck)
The second is to have explicit hardware-supported parallelism with many many simple RISC or MISC cores on an on-chip network. It's simply false to say that small message passing cores have failed. I've already given examples of ones currently on the market (Clearspeed, Picochip, XMOS, and Icera to an extent). It's a model that has been shown time and time again to be extremely scalable, in fact it was done with the Transputer in the late 80s/early 90s. The only reason it's taking off now is because it's the only way forward as we hit the power wall, and shared memory/superscalar can't scale as fast to compete. The reason things like the Transputer didn't take off in mainstream (i.e. desktop) applications is because they were completely steamrolled by what x86 had to offer: an economy of scale, the option to "keep programming like you've always done", and most importantly backwards compatability. In fact they did rather well in i/o control for things such as robotics, and XMOS continues to do well in that space.
The "coherency problem" isn't even part of a message passing architecture because the state is distributed amongst the parallel processes. You just don't program a massively parallel architecture in the same way as a shared memory one.

1000 cores is nothing by Anonymous Coward · 2010-11-21 19:55 · Score: 5, Interesting

Probably in future 1 million cores is minimum requirement for applications. We will then laugh for these stupid comments...

Image and audio recognition, true artificial intelligence, handling data from huge amount of different kind of sensors, movement of motors (robots), data connections to everything around the computer, virtual worlds with thousands of AI characters with true 3D presentation... etc...etc... will consume all processing power available.

1000 cores is nothing... We need much more.

Re:Workaround, yeah by wierd_w · 2010-11-21 20:19 · Score: 5, Informative

You've obviously never worked in Aerospace.

I can bring a quad core Xeon system to its knees running Catia. (I mean, 100% saturation, all 4 cores, with IO contention.) I do it fairly regularly too.

Might have something to do with the NP-Hard problem of resolving tangencies on extremely complex nurbs surfaces. (aircraft skins).

Granted, that is not a "normal" workstation; But I would be VERY happy indeed to have a 1000 core workstation at my disposal. Maybe then I could actually work with Gulfstream's horrible part models where they include literally the whole god-damn aircraft's surface geometry in the digital part model for a fucking bolt. (Guess what happens when you load several such models, and digitally assemble them. I have seen a 64 bit workstation allocate over 8gb of swap because of them and their dumbassery.)

Now, if I could get one with over 1TB of RAM installed too, then I'd be in business.

"Build it and they will come" - NOT by Animats · 2010-11-21 20:34 · Score: 4, Informative

It's an interesting machine. It's a shared-memory multiprocessor without cache coherency. So one way to use it is to allocate disjoint memory to each CPU and run it as a cluster. As the article points out, that is "uninteresting", but at least it's something that's known to work.

Doing something fancier requires a new OS, one that manages clusters, not individual machines. One of the major hypervisors, like Xen, might be a good base for that. Xen already knows how to manage a large number of virtual machines. Managing a large number of real machines with semi-shared memory isn't that big a leap. But that just manages the thing as a cluster. It doesn't exploit the intercommunication.

Intel calls this "A Platform for Software Innovation". What that means is "we have no clue how to program this thing effectively. Maybe academia can figure it out". The last time they tried that, the result was the Itanium.

Historically, there have been far too many supercomputer architectures roughly like this, and they've all been duds. The NCube Hypercube, the Transputer, and the BBN Butterfly come to mind. The Cell machines almost fall into this category. There's no problem building the hardware. It's just not very useful, really tough to program, and the software is too closely tied to a very specific hardware architecture.

Shared-memory multiprocessors with with cache coherency have already reached 256 CPUs. You can even run Windows Server or Linux on them. The headaches of dealing with non-cache-coherent memory may not be worth it.

Re:Imagine by seifried · 2010-11-21 20:37 · Score: 5, Informative

Linux can only go to 256 cores.

Uhmm no.

./arch/ia64/Kconfig: int "Maximum number of CPUs (2-4096)"
/arch/powerpc/platforms/Kconfig.cputype: int "Maximum number of CPUs (2-8192)"

In x86 we have:

config MAXSMP
bool "Enable Maximum number of SMP Processors and NUMA Nodes"
depends on X86_64 && SMP && DEBUG_KERNEL && EXPERIMENTAL

And I believe you can crank that dial all the way up

Also consider this: the number of cores in my desktop is doubling every year or two (and this is with a single core chip), 6 and 8 cores are cheap now, so we'll be at 1024 in roughly 7-14 years which makes sense because the GHz war is done and simply making more cores is relatively cheap (once you have the interconnect making a bigger CPU isn't all that hard).

Re:Temperature? by c0lo · 2010-11-21 20:40 · Score: 4, Interesting

Dude, what the fuck, that's only 48 cores. How does that get you anywhere close to 1000?

Well, Watson, that's elementary...

The correct question should have been: "How many watts one needs to dissipate"... because the temperature is given by "How high and still have the transistors working".
In regards with the power dissipation: the architecture would have a common component (event passing, RAM fetches, etc) and N cores. Assuming each core needs to dissipate the same power (say, at peak utilization) and assuming the 25-125 Watts being the range defined by "1 core used" to "all 48 cores used", some simple linear algebra gives: power dissipated/core approx 2 watts (a bit more actually) with the "common component" eating approx 23 Watts.
Therefore, on top of the computation benefits derived from fully utilizing 1000 cores, one would have a pretty good heat source: 2150 Watts or so. One's choice what to do with it, but it's far too high for a domestic-sized slow cooker (the dished would come with a weird burned taste).

Satisfied, now?

If not, to put the things in perspective, assuming our ancestors (that could use only horses as a source of power) would have wanted to use this computer, they's need approx. 2.68 horses... but hey, wow... what a delight to play the MMORPG so smooth... especially in "farming/grinding" phases.

PS. the above computations are meant to be funny and/or an exercise of approximating based on insufficient data and/or vent some frustration caused by "all work and no play", definitely a wasted time... Ah, yes, some karma would be nice, but not mandatory.

--
Questions raise, answers kill. Raise questions to stay alive.

I/O and memory bandwidth by francium+de+neobie · 2010-11-21 20:44 · Score: 3, Insightful

Ok, you can cram 1000 cores into one CPU chip - but feeding all 1000 CPU cores with enough data for them to process and transferring all the data they spit out is gonna be a big problem. Things like OpenCL work now because the high end GPUs these days have 100GB/s+ bandwidth to the local video memory chips, and you're only pulling out the result back into system memory after the GPU did all the hard work. But doing the same thing on a system level - you're gonna have problems with your usual DDR3 modules, your SSD hard disk (even PCI-E based) and your 10GE network interface.

Remember the last couple of times this happened? by Required+Snark · 2010-11-21 21:02 · Score: 5, Informative

This is at least the third time that Intel has said that it is going to change the way computing is done.

The first time was the i432 http://en.wikipedia.org/wiki/Intel_iAPX_432 Anyone remember that hype? Got to love the first line of the Wikipedia article "The Intel iAPX 432 was a commercially unsuccessful 32-bit microprocessor architecture, introduced in 1981."

The second time was the Itanium (aka Itanic) that was going to bring VLIW to the masses. Check out some of the juicy parts of the timeline also over on Wikipedia http://en.wikipedia.org/wiki/Itanium#Timeline

1997 June: IDC predicts IA-64 systems sales will reach $38bn/yr by 2001

1998 June: IDC predicts IA-64 systems sales will reach $30bn/yr by 2001

1999 October: the term Itanic is first used in The Register

2000 June: IDC predicts Itanium systems sales will reach $25bn/yr by 2003

2001 June: IDC predicts Itanium systems sales will reach $15bn/yr by 2004

2001 October: IDC predicts Itanium systems sales will reach $12bn/yr by the end of 2004

2002 IDC predicts Itanium systems sales will reach $5bn/yr by end 2004

2003 IDC predicts Itanium systems sales will reach $9bn/yr by end 2007

2003 April: AMD releases Opteron, the first processor with x86-64 extensions

2004 June: Intel releases its first processor with x86-64 extensions, a Xeon processor codenamed "Nocona"

2004 December: Itanium system sales for 2004 reach $1.4bn

2005 February: IBM server design drops Itanium support

2005 September: Dell exits the Itanium business

2005 October: Itanium server sales reach $619M/quarter in the third quarter.

2006 February: IDC predicts Itanium systems sales will reach $6.6bn/yr by 2009

2007 November: Intel renames the family from Itanium 2 back to Itanium.

2009 December: Red Hat announces that it is dropping support for Itanium in the next release of its enterprise OS

2010 April: Microsoft announces phase-out of support for Itanium.

So how do you think it will go this time?

--
Why is Snark Required?

Re:Imagine by pyalot · 2010-11-21 21:33 · Score: 4, Informative

You're having a supercomputer on your desk right now. It's called a "GPU", and most likely, it sports many hundred cores. Oh, and the killer app you mean, that's whatever latest DX11/Opengl4 game you prefer.

--
Experiments and other stuff

Re:does it run Linux - yea but it is "boring" by vojtech · 2010-11-21 21:34 · Score: 4, Informative

The current limit on Linux (with 2.6 series) is 8192 CPUs on POWER and 4096 on x86. And there are even a number of non-x86 machines today that reach these sizes in a cache-coherent (ccNUMA) manner that Linux works well on. You still have to be careful with application design, though, because it's fairly easy to hit bottlenecks either in the application or in the kernel that will limit scalability. Most common workloads are already seeing

Re:Workaround, yeah by MichaelSmith · 2010-11-21 22:34 · Score: 4, Interesting

In my field it would be real time conflict detection between aircraft. The better your conflict detection, the more aircraft you can pack in to small volumes of space. There is a lot of money in that.

--
http://michaelsmith.id.au

Re:Imagine by bertok · 2010-11-21 23:10 · Score: 4, Interesting

depends on X86_64 && SMP && DEBUG_KERNEL && EXPERIMENTAL

And I believe you can crank that dial all the way up

Also consider this: the number of cores in my desktop is doubling every year or two (and this is with a single core chip), 6 and 8 cores are cheap now, so we'll be at 1024 in roughly 7-14 years which makes sense because the GHz war is done and simply making more cores is relatively cheap (once you have the interconnect making a bigger CPU isn't all that hard).

Don't you worry, the GHz war is not done!

There's talk of exotic materials (SiC, diamond, etc...) going to 10 GHz. If someone figures out how to make the Rapid Single Flux Quantum digital chips with high temperature superconductors, then we may seriously start to see 1 THz clock speeds in practical computers, using extreme Peltier cooling to get the CPU core down to cryogenic temps.

Re:Imagine by TheRaven64 · 2010-11-21 23:55 · Score: 4, Interesting

Pretty much anything that I've written in Erlang uses (at least) a few thousand concurrent processes. I've never tried running it on more than a 64-core machine, but when I moved stuff from my single-core laptop to a 64-core SGI machine the load was pretty evenly distributed.

It's pretty easy to write concurrent code that scales as long as you respect one rule: No data may be both mutable and aliased. You can do this in object-oriented languages with the actor model, but languages like Erlang enforce it for you (at the cost of a few redundant copies).

--
I am TheRaven on Soylent News

In the near future... by rebelwarlock · 2010-11-21 23:59 · Score: 4, Funny

I will need to buy a pair of sunglasses, and crush them when I find that the new Intel processor has over 9000 cores.

Re:Why is 8192 a hard limit? by TheRaven64 · 2010-11-22 00:36 · Score: 3, Informative

The kernel needs some data structures per processor. 8192 means it needs a 15-bit index for them. I'm not certain about the Linux kernel, but in other kernels it's quite common for this to be squeezed in to other values for various reasons, so adding more processors requires you to either increase the size of other data structures (often ones designed to be exactly one word long). Not impossible, but more effort than just changing a constant.

The reason for the limit in the Windows NT kernel is that various things use bit masks with processor IDs as the indexes. For example, when defining processor affinity set you have an n-bit bitfield (one bit per supported processor), with the bit set if the thread is allowed to run on that processor. At 256 bits (the current limit for Windows), these are already pretty large to scan (especially since the kernel isn't allowed to use SSE instructions, meaning that it's potentially got to be 4 64-bit lsb-tests to find the next core to use).

--
I am TheRaven on Soylent News

For us not at SC10 by Eladith · 2010-11-22 01:10 · Score: 3, Informative

The paper referenced in the arcticle can be found here.

Fascinating that MPI works that well unmodified.

Re:Imagine by chrysrobyn · 2010-11-22 01:37 · Score: 5, Interesting

Don't you worry, the GHz war is not done! There's talk of exotic materials (SiC, diamond, etc...) going to 10 GHz. If someone figures out how to make the Rapid Single Flux Quantum [wikipedia.org] digital chips with high temperature superconductors, then we may seriously start to see 1 THz clock speeds in practical computers, using extreme Peltier cooling to get the CPU core down to cryogenic temps.

The GHz war is over. The speed of light won. A long time ago, it stopped being "all about the transistor" and started being "all about the wires". IBM won the race to copper in 180nm (back when it was 0.18um), and that helped make those technologies even better, but about the time we hit 90nm, semiconductors were "fast enough", or even by some measurements stopped being able to speed up. Since then, almost all speed increases have been largely (but not exclusively) due to the transistors getting smaller, reducing the distance wires need to go.

The RC delay of wires is the major problem. R isn't going to be getting much better than copper. Silver has a lower resistance by a little bit, but it's too reactive to be used anywhere real. In these geometries, any alloy would be insufficiently mixable to be reliable, to say nothing about more exotic materials (like ceramics). There's some room for improvement in the dielectric (the "C"), but by the time you make a box with corners covering water permeability, thermal coefficient of expansion close to the wires, mechanical properties friendly to sub micron manufacturing, you have to concede you're not going to be able to get more than 20% faster there (and that we could dispute separately).

Take a cache. The slowest path is having a memory cell read. That tiny little device needs to have a measurable change in voltage on the bitlines, and be sensed by a sensing structure. That sensing structure has nothing to do with storage, so it's pure overhead and thusly you want as few of them as possible. Can you have it 16 bits away? 32? The days are gone that it was 64 bits away for any meaningful performance. There's nothing you can do to the characteristics of that little device (which needs to be minimum feature size to maximize the density of the cache) to dominate over the characteristics of the bitline he's trying to affect.

Take a data path. Even if 95% of your data is highly predictable, easily pipelined stuff with local signals, your critical path is going to involve signals from other areas of the chip, and they're going to have to be rebuffered and trucked from hundreds of microns away. No giant buffer in the history of man can dominate over a long distance wire. The signal will show up "eventually".

3GHz is a good place to stop. We make it to 4GHz with compromises in power, but beyond that and you're dedicating so much of your chip to rebuffering that you're blowing a lot of power on that. At that point, your pipeline is so many stages that branch mispredicts are very painful. You're devoting so much of your cycle time to setup and holds for your latches that you're going to be embarassed at how little work you can do in each cycle.

1 THz clock speeds are on their way, and maybe even higher. But they're not useful to CPUs or GPUs. They're useful for more exotic applications, primarily technology demonstrations.

Slashdot Mirror

Intel Talks 1000-Core Processors

39 of 326 comments (clear)