Intel Talks 1000-Core Processors

Message passing between cores? Hmm... by PaulBu · 2010-11-21 19:00 · Score: 3, Interesting

Are they trying to reinvent Transputer? :)

But yes, I am happy to see Intel pushing it forward!

Paul B.

Re:Message passing between cores? Hmm... by TinkersDamn · 2010-11-21 20:37 · Score: 2, Interesting

Yes, I've been wondering the same thing. Transputers contained key ideas that seem to be coming around again...
But a more crucial thing might be how much heat can you handle on one chip? These guys are already at 25-125 watts, likely depending on how many cores are actually turned on. After all they're playing pretty hefty heat management tricks on current i7's and Phenom's.
http://techreport.com/articles.x/15818/2
What use are 48 cores, let alone 1000 if they're all being slowed down to 50% or whatever by heat and power juggling?

accurate representation by pyronordicman · 2010-11-21 19:12 · Score: 5, Interesting

Having been in attendance of this presentation at Supercomputing 2010, for once I can say without a doubt that the article captured the essence of reality. The only part it left out is that the interconnect between all the processing elements uses significantly less energy than that of the previous 80-core chip; I think the figure was around 10% of chip power for the 48-core, and 30% for the 80-core. Oh, and MPI over TCP/IP was faster than the native message passing scheme for large messages.

gpu's have been doing this for years... by Anonymous Coward · 2010-11-21 19:23 · Score: 1, Interesting

given that for years GPU's have hand hundreds of processors (the power of CUDA is awesome!) this is long over due by lazy CPU designers like Intel....

Future of Programming by igreaterthanu · 2010-11-21 19:31 · Score: 4, Interesting

This just goes to show that if you care about having a future career (or even just continuing with your existing one) in programming, Learn a functional language NOW!

--
I dream of a nation where a man is not judged by his skin color but by an number assigned by a credit rating agency.

Re:Future of Programming by jamesswift · 2010-11-21 20:37 · Score: 2, Interesting

It's quite something isn't it, how so few people on even slashdot seem to get this. Old habits die hard I guess.
Years ago a clever friend of mine clued me into how functional was going to be important.
He was so right and the real solutions to concurrency (note, not parallelism which is easy enough in imperative) are in the world of FP or at least mostly FP.
My personal favourite so far is Clojure which has the most comprehensive and realistic approach to concurrency I've seen yet in a language ready for real world work.
The key thing to learn from it is how differently you need to approach your problem to take advantage of a mutli-core world.
Clojure itself may never become a top-5 language but they way it approaches the problem surely will be seen in other future FP langs.

--
i wish i could stop

Re:Imagine by JWSmythe · 2010-11-21 19:33 · Score: 2, Interesting

Why? :) I know. meme. It's just, I've built a couple Beowulf clusters for fun, and didn't have an application written to use MPI (or any of the alphabet soup of protocols), so it was just an exercise, not for any practical use. It's not like most of us are crunching numbers hard enough to need one, and it won't help out playing games or even building kernels.

I'd like to see a 1k core machine on my desktop, but that's beyond the practical limits of any software currently available. Linux can only go to 256 cores. Windows 2008 tops out at 64. But hey, if they did come to market, I know who would be first to support all those cores, and it doesn't come from Redmond (or their offshore outsourced developers).

--
Serious? Seriousness is well above my pay grade.

Re:does it run Linux - yea but it is "boring" by RAMMS+EIN · 2010-11-21 19:45 · Score: 4, Interesting

Running Linux on a 48-core system is boring, because it has already been run on a 64-core system in 2007 (at the time, Tilera said they would be up to 1000 cores in 2014; they're up to 100 cores per CPU now).

As far as I know, Linux currently supports up to 256 CPUs. I assume that means logical CPUs, so that, for example, this would support one CPU with 256 cores, or one CPU with 128 cores with two CPU threads per core, etc.

--
Please correct me if I got my facts wrong.

1000 cores is nothing by Anonymous Coward · 2010-11-21 19:55 · Score: 5, Interesting

Probably in future 1 million cores is minimum requirement for applications. We will then laugh for these stupid comments...

Image and audio recognition, true artificial intelligence, handling data from huge amount of different kind of sensors, movement of motors (robots), data connections to everything around the computer, virtual worlds with thousands of AI characters with true 3D presentation... etc...etc... will consume all processing power available.

1000 cores is nothing... We need much more.

Re:Temperature? by c0lo · 2010-11-21 20:40 · Score: 4, Interesting

Dude, what the fuck, that's only 48 cores. How does that get you anywhere close to 1000?

Well, Watson, that's elementary...

The correct question should have been: "How many watts one needs to dissipate"... because the temperature is given by "How high and still have the transistors working".
In regards with the power dissipation: the architecture would have a common component (event passing, RAM fetches, etc) and N cores. Assuming each core needs to dissipate the same power (say, at peak utilization) and assuming the 25-125 Watts being the range defined by "1 core used" to "all 48 cores used", some simple linear algebra gives: power dissipated/core approx 2 watts (a bit more actually) with the "common component" eating approx 23 Watts.
Therefore, on top of the computation benefits derived from fully utilizing 1000 cores, one would have a pretty good heat source: 2150 Watts or so. One's choice what to do with it, but it's far too high for a domestic-sized slow cooker (the dished would come with a weird burned taste).

Satisfied, now?

If not, to put the things in perspective, assuming our ancestors (that could use only horses as a source of power) would have wanted to use this computer, they's need approx. 2.68 horses... but hey, wow... what a delight to play the MMORPG so smooth... especially in "farming/grinding" phases.

PS. the above computations are meant to be funny and/or an exercise of approximating based on insufficient data and/or vent some frustration caused by "all work and no play", definitely a wasted time... Ah, yes, some karma would be nice, but not mandatory.

--
Questions raise, answers kill. Raise questions to stay alive.

Re:Imagine by visualight · 2010-11-21 20:58 · Score: 2, Interesting

http://www.sgi.com/products/servers/altix/uv/

2,048 cores (256 sockets) and 16TB of memory, one OS image.

--
Samsung took back my unlocked bootloader because Google wants me to rent movies. They're both evil.

cue kilocore debates by bingoUV · 2010-11-21 21:31 · Score: 2, Interesting

Do 1024 cores constitute a kilocore? Or 1000? I'd love to see that debate move from hard disks to processors.

--
Bingo Dictionary - Pragmatist, n. A myopic idealist.

Re:Imagine by Siffy · 2010-11-21 21:38 · Score: 2, Interesting

Why not? He/she built a cluster for no use at all other than learning and fun. I can easily see the "use" for 1k cores with Intel's apparent interest to get into the 3d market or at least destroy Nvidia and ATI (something AMD has already done in name but that's beside the point). For clusters it's a no-brainer to keep adding cores if you can increase performance per watt ratio with each additional core. For desktops there likely will be a point where enough is enough, but I disagree that we've passed it. Software designers are still keeping up quite quickly with any headroom new hardware creates.

Re:Workaround, yeah by MichaelSmith · 2010-11-21 22:34 · Score: 4, Interesting

In my field it would be real time conflict detection between aircraft. The better your conflict detection, the more aircraft you can pack in to small volumes of space. There is a lot of money in that.

--
http://michaelsmith.id.au

Re:Imagine by bertok · 2010-11-21 23:10 · Score: 4, Interesting

depends on X86_64 && SMP && DEBUG_KERNEL && EXPERIMENTAL

And I believe you can crank that dial all the way up

Also consider this: the number of cores in my desktop is doubling every year or two (and this is with a single core chip), 6 and 8 cores are cheap now, so we'll be at 1024 in roughly 7-14 years which makes sense because the GHz war is done and simply making more cores is relatively cheap (once you have the interconnect making a bigger CPU isn't all that hard).

Don't you worry, the GHz war is not done!

There's talk of exotic materials (SiC, diamond, etc...) going to 10 GHz. If someone figures out how to make the Rapid Single Flux Quantum digital chips with high temperature superconductors, then we may seriously start to see 1 THz clock speeds in practical computers, using extreme Peltier cooling to get the CPU core down to cryogenic temps.

Re:Instruction set... by kohaku · 2010-11-21 23:24 · Score: 3, Interesting

That's a clear testament to scalability when you consider the speed improvement in the last 30 years using basically the same ISA.

It's scaled that way until now. We've hit a power wall in the last few years: as you increase the number of transistors on chip it gets more difficult to distribute a faster clock synchronously, so you increase the power, which is why Nehalem is so power hungry, and why you haven't seen clock speeds really increase since the P4. In any case, we're talking about parallelism, not just "increasing the clock speed" which isn't even a viable approach anymore.
When you said "Compact" I assumed you meant the instruction set itself was compact rather than the average length- I was talking about the hardware needed to decode, not necessarily code density. Even so, x86 is nothing special when it comes to density, especially considered against things like ARM's Thumb-2.
If you take look at Nehalem's pipeline, there's a significant chunk of it simply dedicated to translating x86 instructions into RISC uops, which is only there for backwards compatability. The inner workings of the chip don't even see x86 instructions.
Sure you can do everything the same with shared memory and channel comms, but if you have a multi-node system, you're going to be doing channel communcation anyway. You also have to consider that memory speed is a bottleneck that just won't go away, and for massive parallelism on-chip networks are just faster. In fact, Intel's QPI and AMD's HyperTransport are examples of on-chip network- they provide a NUMA on Nehalem and whatever AMD have these days. Indeed, in the article, it says

Mattson has argued that a better approach would be to eliminate cache coherency and instead allow cores to pass messages among one another.

The thing is, if you want to put more cores on a die, you need either a bigger die or smaller cores. x86 is stuck with larger cores because of all the translation and prediction it's required to do to be both backwards compatible and reasonably well-performing. If you're scaling horizontally like that, you want the simplest core possible, which is why this chip only has 48 cores, and Clearspeed's 2-year-old CSX700 had 192.

Re:Instruction set... by Arlet · 2010-11-21 23:49 · Score: 4, Interesting

The thing is, if you want to put more cores on a die, you need either a bigger die or smaller cores

Nobody wants to put more cores on a die, but they're forced to do so because they reach the limits of a single core. I'd rather have as few cores as possible, but have each one be really powerful. Once multiple cores are required, I'd want them to stretch the coherent shared memory concept as far as it will go. When that concept doesn't scale anymore, use something like NUMA.

Small, message passing cores have been tried multiple times, and they've always failed. The problem is that the requirement of distributed state coherency doesn't go away. The burden only gets shifted from the hardware to the software, where it is just as hard to accomplish, but much slower. In addition, if you try to tackle the coherency problem in software, you don't get to benefit from hardware improvements.

Re:Imagine by TheRaven64 · 2010-11-21 23:55 · Score: 4, Interesting

Pretty much anything that I've written in Erlang uses (at least) a few thousand concurrent processes. I've never tried running it on more than a 64-core machine, but when I moved stuff from my single-core laptop to a 64-core SGI machine the load was pretty evenly distributed.

It's pretty easy to write concurrent code that scales as long as you respect one rule: No data may be both mutable and aliased. You can do this in object-oriented languages with the actor model, but languages like Erlang enforce it for you (at the cost of a few redundant copies).

--
I am TheRaven on Soylent News

Re:Instruction set... by kohaku · 2010-11-22 00:33 · Score: 3, Interesting

they're forced to do so because they reach the limits of a single core

Well yes, but you might as well have argued that nobody wanted to make faster cores but they're limited by current clock speeds... The fact is that you can no longer make cores faster and bigger, you have to go parallel. Even the intel researcher in the article is saying the shared memory concept needs to be abandoned to scale up.
Essentially there are two approaches to the problem of performance now. Both use parallelism. The first (Nehalem's) is to have a 'powerful' superscalar core with lots of branch prediction and out-of-order logic to run instructions from the same process in parallel. It results in a few, high performance cores that won't scale horizontally (memory bottleneck)
The second is to have explicit hardware-supported parallelism with many many simple RISC or MISC cores on an on-chip network. It's simply false to say that small message passing cores have failed. I've already given examples of ones currently on the market (Clearspeed, Picochip, XMOS, and Icera to an extent). It's a model that has been shown time and time again to be extremely scalable, in fact it was done with the Transputer in the late 80s/early 90s. The only reason it's taking off now is because it's the only way forward as we hit the power wall, and shared memory/superscalar can't scale as fast to compete. The reason things like the Transputer didn't take off in mainstream (i.e. desktop) applications is because they were completely steamrolled by what x86 had to offer: an economy of scale, the option to "keep programming like you've always done", and most importantly backwards compatability. In fact they did rather well in i/o control for things such as robotics, and XMOS continues to do well in that space.
The "coherency problem" isn't even part of a message passing architecture because the state is distributed amongst the parallel processes. You just don't program a massively parallel architecture in the same way as a shared memory one.

Re:Instruction set... by kohaku · 2010-11-22 01:05 · Score: 2, Interesting

There's a third option: combine the best of both worlds. Use powerful, superscalar cores with shared memory, as powerful as you can reasonably make them, and then run clusters of those in parallel.

Which is of course what is already being done, but whether that's the best approach remains to be seen. Communication is always the bottleneck in HPC systems, and many processors on chip with a fast interconnect seems to do very well, at least for Picochip (though it is a DSP chip, I think it's a valid comparison).

Well, there's your problem. Many real world applications can only be programmed that way.

Examples? It's just a different model, it's doesn't prevent you solving any problem.

The ClearSpeed 192-core CSX700 is on the market, but nobody is buying it

Yeah, that was a shame. The trouble is that HPC-specific chips are just going to get steamrolled on the price point by commodity (x86) hardware. But what about the other three that are selling like hotcakes?

Re:does it run Linux - yea but it is "boring" by vojtech · 2010-11-22 01:06 · Score: 2, Interesting

Well, well, I hit the Submit button too soon. Anyway, most common workloads are already seeing decreasing benefits around 32 parallel threads.

Re:Instruction set... by Anonymous Coward · 2010-11-22 01:32 · Score: 1, Interesting

Thanks for the links! I'll add GreenArrays and XMOS, although the GA interconnect seems overly primitive. Tilera tends to be mentioned a lot.

Re:Imagine by chrysrobyn · 2010-11-22 01:37 · Score: 5, Interesting

Don't you worry, the GHz war is not done! There's talk of exotic materials (SiC, diamond, etc...) going to 10 GHz. If someone figures out how to make the Rapid Single Flux Quantum [wikipedia.org] digital chips with high temperature superconductors, then we may seriously start to see 1 THz clock speeds in practical computers, using extreme Peltier cooling to get the CPU core down to cryogenic temps.

The GHz war is over. The speed of light won. A long time ago, it stopped being "all about the transistor" and started being "all about the wires". IBM won the race to copper in 180nm (back when it was 0.18um), and that helped make those technologies even better, but about the time we hit 90nm, semiconductors were "fast enough", or even by some measurements stopped being able to speed up. Since then, almost all speed increases have been largely (but not exclusively) due to the transistors getting smaller, reducing the distance wires need to go.

The RC delay of wires is the major problem. R isn't going to be getting much better than copper. Silver has a lower resistance by a little bit, but it's too reactive to be used anywhere real. In these geometries, any alloy would be insufficiently mixable to be reliable, to say nothing about more exotic materials (like ceramics). There's some room for improvement in the dielectric (the "C"), but by the time you make a box with corners covering water permeability, thermal coefficient of expansion close to the wires, mechanical properties friendly to sub micron manufacturing, you have to concede you're not going to be able to get more than 20% faster there (and that we could dispute separately).

Take a cache. The slowest path is having a memory cell read. That tiny little device needs to have a measurable change in voltage on the bitlines, and be sensed by a sensing structure. That sensing structure has nothing to do with storage, so it's pure overhead and thusly you want as few of them as possible. Can you have it 16 bits away? 32? The days are gone that it was 64 bits away for any meaningful performance. There's nothing you can do to the characteristics of that little device (which needs to be minimum feature size to maximize the density of the cache) to dominate over the characteristics of the bitline he's trying to affect.

Take a data path. Even if 95% of your data is highly predictable, easily pipelined stuff with local signals, your critical path is going to involve signals from other areas of the chip, and they're going to have to be rebuffered and trucked from hundreds of microns away. No giant buffer in the history of man can dominate over a long distance wire. The signal will show up "eventually".

3GHz is a good place to stop. We make it to 4GHz with compromises in power, but beyond that and you're dedicating so much of your chip to rebuffering that you're blowing a lot of power on that. At that point, your pipeline is so many stages that branch mispredicts are very painful. You're devoting so much of your cycle time to setup and holds for your latches that you're going to be embarassed at how little work you can do in each cycle.

1 THz clock speeds are on their way, and maybe even higher. But they're not useful to CPUs or GPUs. They're useful for more exotic applications, primarily technology demonstrations.

Re:Instruction set... by Arlet · 2010-11-22 03:32 · Score: 2, Interesting

We way we do it now is a single filesystem layer which is, at all times, in a single coherent state. With today's shared memory systems, and cache coherency guaranteed by the hardware, that's reasonable easy to accomplish.

The current filesystem concept just doesn't map onto 1000 non-coherent cores.

Re:Nobody wants to put more cores on a chip? by Arlet · 2010-11-22 03:43 · Score: 2, Interesting

The '93 era Pentium they're talking about only has 3 million transistors, and only a fraction are needed to handle the x86 instruction set. Current transistor count goes into the billions, so as far as real estate goes, you can put 1000 Pentium class cores on a single die, despite the x86 translations.

Of course, the whole concept of a 1000 cores running on a single die is only going to serve a small niche of applications.

Biggest problem and a fix... by Panaflex · 2010-11-22 03:44 · Score: 2, Interesting

IMHO the biggest problem with these multi-core chips is the lock latency. Locking in heap all works great, but a shared hw register of locks would save a lot of cache coherency and MMU copies.

A 1024 slot register with instruction support for mutex and read-write locks would be fantastic.

I'm developing 20+Gbps applications - we need fast locks and low latency. Snap snap!!!

--
I said no... but I missed and it came out yes.

Re:Imagine by TheRaven64 · 2010-11-22 04:22 · Score: 2, Interesting

I don't know what extra detail you need - the rule should be pretty self explanatory. If something is shared between two or more threads, it should be immutable. If something is mutable, only one thread / process should hold references to it.

The only exception to this rule is explicitly synchronised communication objects (message queues, process handles, and suchlike). If you follow this rule, then the only concurrency problems that you will have are caused by high-level design problems, rather than by low-level implementation problems.

Erlang enforces this by only having one mutable object: the process dictionary, which is only accessible by the process that owns it. Everything else is immutable.

--
I am TheRaven on Soylent News

Slashdot Mirror

Intel Talks 1000-Core Processors

27 of 326 comments (clear)