Intel Talks 1000-Core Processors
angry tapir writes "An experimental Intel chip shows the feasibility of building processors with 1,000 cores, an Intel researcher has asserted. The architecture for the Intel 48-core Single Chip Cloud Computer processor is 'arbitrarily scalable,' according to Timothy Mattson. 'This is an architecture that could, in principle, scale to 1,000 cores,' he said. 'I can just keep adding, adding, adding cores.'"
Are they trying to reinvent Transputer? :)
But yes, I am happy to see Intel pushing it forward!
Paul B.
Having been in attendance of this presentation at Supercomputing 2010, for once I can say without a doubt that the article captured the essence of reality. The only part it left out is that the interconnect between all the processing elements uses significantly less energy than that of the previous 80-core chip; I think the figure was around 10% of chip power for the 48-core, and 30% for the 80-core. Oh, and MPI over TCP/IP was faster than the native message passing scheme for large messages.
given that for years GPU's have hand hundreds of processors (the power of CUDA is awesome!) this is long over due by lazy CPU designers like Intel....
This just goes to show that if you care about having a future career (or even just continuing with your existing one) in programming, Learn a functional language NOW!
I dream of a nation where a man is not judged by his skin color but by an number assigned by a credit rating agency.
Why? :) I know. meme. It's just, I've built a couple Beowulf clusters for fun, and didn't have an application written to use MPI (or any of the alphabet soup of protocols), so it was just an exercise, not for any practical use. It's not like most of us are crunching numbers hard enough to need one, and it won't help out playing games or even building kernels.
I'd like to see a 1k core machine on my desktop, but that's beyond the practical limits of any software currently available. Linux can only go to 256 cores. Windows 2008 tops out at 64. But hey, if they did come to market, I know who would be first to support all those cores, and it doesn't come from Redmond (or their offshore outsourced developers).
Serious? Seriousness is well above my pay grade.
Running Linux on a 48-core system is boring, because it has already been run on a 64-core system in 2007 (at the time, Tilera said they would be up to 1000 cores in 2014; they're up to 100 cores per CPU now).
As far as I know, Linux currently supports up to 256 CPUs. I assume that means logical CPUs, so that, for example, this would support one CPU with 256 cores, or one CPU with 128 cores with two CPU threads per core, etc.
Please correct me if I got my facts wrong.
Probably in future 1 million cores is minimum requirement for applications. We will then laugh for these stupid comments...
Image and audio recognition, true artificial intelligence, handling data from huge amount of different kind of sensors, movement of motors (robots), data connections to everything around the computer, virtual worlds with thousands of AI characters with true 3D presentation... etc...etc... will consume all processing power available.
1000 cores is nothing... We need much more.
Dude, what the fuck, that's only 48 cores. How does that get you anywhere close to 1000?
Well, Watson, that's elementary...
Therefore, on top of the computation benefits derived from fully utilizing 1000 cores, one would have a pretty good heat source: 2150 Watts or so. One's choice what to do with it, but it's far too high for a domestic-sized slow cooker (the dished would come with a weird burned taste).
Satisfied, now?
If not, to put the things in perspective, assuming our ancestors (that could use only horses as a source of power) would have wanted to use this computer, they's need approx. 2.68 horses... but hey, wow... what a delight to play the MMORPG so smooth... especially in "farming/grinding" phases.
PS. the above computations are meant to be funny and/or an exercise of approximating based on insufficient data and/or vent some frustration caused by "all work and no play", definitely a wasted time... Ah, yes, some karma would be nice, but not mandatory.
Questions raise, answers kill. Raise questions to stay alive.
http://www.sgi.com/products/servers/altix/uv/
2,048 cores (256 sockets) and 16TB of memory, one OS image.
Samsung took back my unlocked bootloader because Google wants me to rent movies. They're both evil.
Do 1024 cores constitute a kilocore? Or 1000? I'd love to see that debate move from hard disks to processors.
Bingo Dictionary - Pragmatist, n. A myopic idealist.
Why not? He/she built a cluster for no use at all other than learning and fun. I can easily see the "use" for 1k cores with Intel's apparent interest to get into the 3d market or at least destroy Nvidia and ATI (something AMD has already done in name but that's beside the point). For clusters it's a no-brainer to keep adding cores if you can increase performance per watt ratio with each additional core. For desktops there likely will be a point where enough is enough, but I disagree that we've passed it. Software designers are still keeping up quite quickly with any headroom new hardware creates.
In my field it would be real time conflict detection between aircraft. The better your conflict detection, the more aircraft you can pack in to small volumes of space. There is a lot of money in that.
http://michaelsmith.id.au
depends on X86_64 && SMP && DEBUG_KERNEL && EXPERIMENTAL
And I believe you can crank that dial all the way up
Also consider this: the number of cores in my desktop is doubling every year or two (and this is with a single core chip), 6 and 8 cores are cheap now, so we'll be at 1024 in roughly 7-14 years which makes sense because the GHz war is done and simply making more cores is relatively cheap (once you have the interconnect making a bigger CPU isn't all that hard).
Don't you worry, the GHz war is not done!
There's talk of exotic materials (SiC, diamond, etc...) going to 10 GHz. If someone figures out how to make the Rapid Single Flux Quantum digital chips with high temperature superconductors, then we may seriously start to see 1 THz clock speeds in practical computers, using extreme Peltier cooling to get the CPU core down to cryogenic temps.
It's scaled that way until now. We've hit a power wall in the last few years: as you increase the number of transistors on chip it gets more difficult to distribute a faster clock synchronously, so you increase the power, which is why Nehalem is so power hungry, and why you haven't seen clock speeds really increase since the P4. In any case, we're talking about parallelism, not just "increasing the clock speed" which isn't even a viable approach anymore.
When you said "Compact" I assumed you meant the instruction set itself was compact rather than the average length- I was talking about the hardware needed to decode, not necessarily code density. Even so, x86 is nothing special when it comes to density, especially considered against things like ARM's Thumb-2.
If you take look at Nehalem's pipeline, there's a significant chunk of it simply dedicated to translating x86 instructions into RISC uops, which is only there for backwards compatability. The inner workings of the chip don't even see x86 instructions.
Sure you can do everything the same with shared memory and channel comms, but if you have a multi-node system, you're going to be doing channel communcation anyway. You also have to consider that memory speed is a bottleneck that just won't go away, and for massive parallelism on-chip networks are just faster. In fact, Intel's QPI and AMD's HyperTransport are examples of on-chip network- they provide a NUMA on Nehalem and whatever AMD have these days. Indeed, in the article, it says
The thing is, if you want to put more cores on a die, you need either a bigger die or smaller cores. x86 is stuck with larger cores because of all the translation and prediction it's required to do to be both backwards compatible and reasonably well-performing. If you're scaling horizontally like that, you want the simplest core possible, which is why this chip only has 48 cores, and Clearspeed's 2-year-old CSX700 had 192.
Nobody wants to put more cores on a die, but they're forced to do so because they reach the limits of a single core. I'd rather have as few cores as possible, but have each one be really powerful. Once multiple cores are required, I'd want them to stretch the coherent shared memory concept as far as it will go. When that concept doesn't scale anymore, use something like NUMA.
Small, message passing cores have been tried multiple times, and they've always failed. The problem is that the requirement of distributed state coherency doesn't go away. The burden only gets shifted from the hardware to the software, where it is just as hard to accomplish, but much slower. In addition, if you try to tackle the coherency problem in software, you don't get to benefit from hardware improvements.
Pretty much anything that I've written in Erlang uses (at least) a few thousand concurrent processes. I've never tried running it on more than a 64-core machine, but when I moved stuff from my single-core laptop to a 64-core SGI machine the load was pretty evenly distributed.
It's pretty easy to write concurrent code that scales as long as you respect one rule: No data may be both mutable and aliased. You can do this in object-oriented languages with the actor model, but languages like Erlang enforce it for you (at the cost of a few redundant copies).
I am TheRaven on Soylent News
Well yes, but you might as well have argued that nobody wanted to make faster cores but they're limited by current clock speeds... The fact is that you can no longer make cores faster and bigger, you have to go parallel. Even the intel researcher in the article is saying the shared memory concept needs to be abandoned to scale up.
Essentially there are two approaches to the problem of performance now. Both use parallelism. The first (Nehalem's) is to have a 'powerful' superscalar core with lots of branch prediction and out-of-order logic to run instructions from the same process in parallel. It results in a few, high performance cores that won't scale horizontally (memory bottleneck)
The second is to have explicit hardware-supported parallelism with many many simple RISC or MISC cores on an on-chip network. It's simply false to say that small message passing cores have failed. I've already given examples of ones currently on the market (Clearspeed, Picochip, XMOS, and Icera to an extent). It's a model that has been shown time and time again to be extremely scalable, in fact it was done with the Transputer in the late 80s/early 90s. The only reason it's taking off now is because it's the only way forward as we hit the power wall, and shared memory/superscalar can't scale as fast to compete. The reason things like the Transputer didn't take off in mainstream (i.e. desktop) applications is because they were completely steamrolled by what x86 had to offer: an economy of scale, the option to "keep programming like you've always done", and most importantly backwards compatability. In fact they did rather well in i/o control for things such as robotics, and XMOS continues to do well in that space.
The "coherency problem" isn't even part of a message passing architecture because the state is distributed amongst the parallel processes. You just don't program a massively parallel architecture in the same way as a shared memory one.
Which is of course what is already being done, but whether that's the best approach remains to be seen. Communication is always the bottleneck in HPC systems, and many processors on chip with a fast interconnect seems to do very well, at least for Picochip (though it is a DSP chip, I think it's a valid comparison).
Examples? It's just a different model, it's doesn't prevent you solving any problem.
Yeah, that was a shame. The trouble is that HPC-specific chips are just going to get steamrolled on the price point by commodity (x86) hardware. But what about the other three that are selling like hotcakes?
Well, well, I hit the Submit button too soon. Anyway, most common workloads are already seeing decreasing benefits around 32 parallel threads.
Thanks for the links! I'll add GreenArrays and XMOS, although the GA interconnect seems overly primitive. Tilera tends to be mentioned a lot.
The GHz war is over. The speed of light won. A long time ago, it stopped being "all about the transistor" and started being "all about the wires". IBM won the race to copper in 180nm (back when it was 0.18um), and that helped make those technologies even better, but about the time we hit 90nm, semiconductors were "fast enough", or even by some measurements stopped being able to speed up. Since then, almost all speed increases have been largely (but not exclusively) due to the transistors getting smaller, reducing the distance wires need to go.
The RC delay of wires is the major problem. R isn't going to be getting much better than copper. Silver has a lower resistance by a little bit, but it's too reactive to be used anywhere real. In these geometries, any alloy would be insufficiently mixable to be reliable, to say nothing about more exotic materials (like ceramics). There's some room for improvement in the dielectric (the "C"), but by the time you make a box with corners covering water permeability, thermal coefficient of expansion close to the wires, mechanical properties friendly to sub micron manufacturing, you have to concede you're not going to be able to get more than 20% faster there (and that we could dispute separately).
Take a cache. The slowest path is having a memory cell read. That tiny little device needs to have a measurable change in voltage on the bitlines, and be sensed by a sensing structure. That sensing structure has nothing to do with storage, so it's pure overhead and thusly you want as few of them as possible. Can you have it 16 bits away? 32? The days are gone that it was 64 bits away for any meaningful performance. There's nothing you can do to the characteristics of that little device (which needs to be minimum feature size to maximize the density of the cache) to dominate over the characteristics of the bitline he's trying to affect.
Take a data path. Even if 95% of your data is highly predictable, easily pipelined stuff with local signals, your critical path is going to involve signals from other areas of the chip, and they're going to have to be rebuffered and trucked from hundreds of microns away. No giant buffer in the history of man can dominate over a long distance wire. The signal will show up "eventually".
3GHz is a good place to stop. We make it to 4GHz with compromises in power, but beyond that and you're dedicating so much of your chip to rebuffering that you're blowing a lot of power on that. At that point, your pipeline is so many stages that branch mispredicts are very painful. You're devoting so much of your cycle time to setup and holds for your latches that you're going to be embarassed at how little work you can do in each cycle.
1 THz clock speeds are on their way, and maybe even higher. But they're not useful to CPUs or GPUs. They're useful for more exotic applications, primarily technology demonstrations.
We way we do it now is a single filesystem layer which is, at all times, in a single coherent state. With today's shared memory systems, and cache coherency guaranteed by the hardware, that's reasonable easy to accomplish.
The current filesystem concept just doesn't map onto 1000 non-coherent cores.
The '93 era Pentium they're talking about only has 3 million transistors, and only a fraction are needed to handle the x86 instruction set. Current transistor count goes into the billions, so as far as real estate goes, you can put 1000 Pentium class cores on a single die, despite the x86 translations.
Of course, the whole concept of a 1000 cores running on a single die is only going to serve a small niche of applications.
IMHO the biggest problem with these multi-core chips is the lock latency. Locking in heap all works great, but a shared hw register of locks would save a lot of cache coherency and MMU copies.
A 1024 slot register with instruction support for mutex and read-write locks would be fantastic.
I'm developing 20+Gbps applications - we need fast locks and low latency. Snap snap!!!
I said no... but I missed and it came out yes.
I don't know what extra detail you need - the rule should be pretty self explanatory. If something is shared between two or more threads, it should be immutable. If something is mutable, only one thread / process should hold references to it.
The only exception to this rule is explicitly synchronised communication objects (message queues, process handles, and suchlike). If you follow this rule, then the only concurrency problems that you will have are caused by high-level design problems, rather than by low-level implementation problems.
Erlang enforces this by only having one mutable object: the process dictionary, which is only accessible by the process that owns it. Everything else is immutable.
I am TheRaven on Soylent News