Intel Talks 1000-Core Processors
angry tapir writes "An experimental Intel chip shows the feasibility of building processors with 1,000 cores, an Intel researcher has asserted. The architecture for the Intel 48-core Single Chip Cloud Computer processor is 'arbitrarily scalable,' according to Timothy Mattson. 'This is an architecture that could, in principle, scale to 1,000 cores,' he said. 'I can just keep adding, adding, adding cores.'"
You've obviously never worked in Aerospace.
I can bring a quad core Xeon system to its knees running Catia. (I mean, 100% saturation, all 4 cores, with IO contention.) I do it fairly regularly too.
Might have something to do with the NP-Hard problem of resolving tangencies on extremely complex nurbs surfaces. (aircraft skins).
Granted, that is not a "normal" workstation; But I would be VERY happy indeed to have a 1000 core workstation at my disposal. Maybe then I could actually work with Gulfstream's horrible part models where they include literally the whole god-damn aircraft's surface geometry in the digital part model for a fucking bolt. (Guess what happens when you load several such models, and digitally assemble them. I have seen a 64 bit workstation allocate over 8gb of swap because of them and their dumbassery.)
Now, if I could get one with over 1TB of RAM installed too, then I'd be in business.
It's an interesting machine. It's a shared-memory multiprocessor without cache coherency. So one way to use it is to allocate disjoint memory to each CPU and run it as a cluster. As the article points out, that is "uninteresting", but at least it's something that's known to work.
Doing something fancier requires a new OS, one that manages clusters, not individual machines. One of the major hypervisors, like Xen, might be a good base for that. Xen already knows how to manage a large number of virtual machines. Managing a large number of real machines with semi-shared memory isn't that big a leap. But that just manages the thing as a cluster. It doesn't exploit the intercommunication.
Intel calls this "A Platform for Software Innovation". What that means is "we have no clue how to program this thing effectively. Maybe academia can figure it out". The last time they tried that, the result was the Itanium.
Historically, there have been far too many supercomputer architectures roughly like this, and they've all been duds. The NCube Hypercube, the Transputer, and the BBN Butterfly come to mind. The Cell machines almost fall into this category. There's no problem building the hardware. It's just not very useful, really tough to program, and the software is too closely tied to a very specific hardware architecture.
Shared-memory multiprocessors with with cache coherency have already reached 256 CPUs. You can even run Windows Server or Linux on them. The headaches of dealing with non-cache-coherent memory may not be worth it.
Linux can only go to 256 cores.
Uhmm no.
./arch/ia64/Kconfig: int "Maximum number of CPUs (2-4096)"
/arch/powerpc/platforms/Kconfig.cputype: int "Maximum number of CPUs (2-8192)"
In x86 we have:
config MAXSMP
bool "Enable Maximum number of SMP Processors and NUMA Nodes"
depends on X86_64 && SMP && DEBUG_KERNEL && EXPERIMENTAL
And I believe you can crank that dial all the way up
Also consider this: the number of cores in my desktop is doubling every year or two (and this is with a single core chip), 6 and 8 cores are cheap now, so we'll be at 1024 in roughly 7-14 years which makes sense because the GHz war is done and simply making more cores is relatively cheap (once you have the interconnect making a bigger CPU isn't all that hard).
It seem like I've been here before.
Seastead this.
The first time was the i432 http://en.wikipedia.org/wiki/Intel_iAPX_432 Anyone remember that hype? Got to love the first line of the Wikipedia article "The Intel iAPX 432 was a commercially unsuccessful 32-bit microprocessor architecture, introduced in 1981."
The second time was the Itanium (aka Itanic) that was going to bring VLIW to the masses. Check out some of the juicy parts of the timeline also over on Wikipedia http://en.wikipedia.org/wiki/Itanium#Timeline
1997 June: IDC predicts IA-64 systems sales will reach $38bn/yr by 2001
1998 June: IDC predicts IA-64 systems sales will reach $30bn/yr by 2001
1999 October: the term Itanic is first used in The Register
2000 June: IDC predicts Itanium systems sales will reach $25bn/yr by 2003
2001 June: IDC predicts Itanium systems sales will reach $15bn/yr by 2004
2001 October: IDC predicts Itanium systems sales will reach $12bn/yr by the end of 2004
2002 IDC predicts Itanium systems sales will reach $5bn/yr by end 2004
2003 IDC predicts Itanium systems sales will reach $9bn/yr by end 2007
2003 April: AMD releases Opteron, the first processor with x86-64 extensions
2004 June: Intel releases its first processor with x86-64 extensions, a Xeon processor codenamed "Nocona"
2004 December: Itanium system sales for 2004 reach $1.4bn
2005 February: IBM server design drops Itanium support
2005 September: Dell exits the Itanium business
2005 October: Itanium server sales reach $619M/quarter in the third quarter.
2006 February: IDC predicts Itanium systems sales will reach $6.6bn/yr by 2009
2007 November: Intel renames the family from Itanium 2 back to Itanium.
2009 December: Red Hat announces that it is dropping support for Itanium in the next release of its enterprise OS
2010 April: Microsoft announces phase-out of support for Itanium.
So how do you think it will go this time?
Why is Snark Required?
You're having a supercomputer on your desk right now. It's called a "GPU", and most likely, it sports many hundred cores. Oh, and the killer app you mean, that's whatever latest DX11/Opengl4 game you prefer.
Experiments and other stuff
The current limit on Linux (with 2.6 series) is 8192 CPUs on POWER and 4096 on x86. And there are even a number of non-x86 machines today that reach these sizes in a cache-coherent (ccNUMA) manner that Linux works well on. You still have to be careful with application design, though, because it's fairly easy to hit bottlenecks either in the application or in the kernel that will limit scalability. Most common workloads are already seeing
Why? :) I know. meme. It's just, I've built a couple Beowulf clusters for fun, and didn't have an application written to use MPI (or any of the alphabet soup of protocols), so it was just an exercise, not for any practical use. It's not like most of us are crunching numbers hard enough to need one, and it won't help out playing games or even building kernels.
I'd like to see a 1k core machine on my desktop, but that's beyond the practical limits of any software currently available. Linux can only go to 256 cores. Windows 2008 tops out at 64. But hey, if they did come to market, I know who would be first to support all those cores, and it doesn't come from Redmond (or their offshore outsourced developers).
ummm no. Windows 2008 can handle 64 SOCKETS, it currently scales to 256 cores
Isn't that Intel's pet project for the last decade?
No sig today...
Linux can only go to 256 cores. Windows 2008 tops out at 64.
Linux supports more than 256 cores.
MAINLINE:
Maximum number of CPUs / CONFIG_NR_CPUS:
This allows you to specify the maximum number of CPUs which this kernel will support. The maximum supported value is 512 and the minimum value which makes sense is 2. This is purely to save memory - each supported CPU adds approximately eight kilobytes to the kernel image.
I know SGI has systems running 4096 CPUs with SUSE Linux.
IF YOU GOT A COMMS ERROR, THE ONLY RECOVERY MECHANISM WAS A TOTAL SYSTEM REBOOT.
That is as crap as you can get! TPP/IP might be an improvement, but HDLC would have cracked the Transputer's problems, and it was already over 15 years old when the transputer was invented.
Yes I did build a Transputer based system, and yes it did work. (but...)
Sent from my ASR33 using ASCII
The current limit on Linux (with 2.6 series) is 8192 CPUs on POWER and 4096 on x86
That's kind-of true, but quite misleading. 8192 is the hard limit, but scheduler and related overhead means that the performance gets pretty poor long before then. Please don't cite the big SGI and IBM machines as counter examples. The SGI machines effectively run a cluster OS, but with hardware distributed shared memory. They are 'single system image' in that they appear to be one OS to the user, but each board has its own kernel, I/O peripherals and memory and works largely independently except when accessing data from a remote node (handled by the hardware) or migrating processes to another node (kernel initiates this when it's too heavily loaded on a single node).
The big IBM machines have a similar design, although their big supercomputers don't actually run Linux at all in a meaningful sense. They run no OS on the processors that do the real work - the big compute jobs run without anything interrupting them or competing for CPU time - and run Linux on the coprocessors that handle I/O.
I am TheRaven on Soylent News
I had a board with 4 T800s in my 286 PC, I wrote a raytracer for it.
The chips were OK but the compilers and development kit were terrible.
No sig today...
The kernel needs some data structures per processor. 8192 means it needs a 15-bit index for them. I'm not certain about the Linux kernel, but in other kernels it's quite common for this to be squeezed in to other values for various reasons, so adding more processors requires you to either increase the size of other data structures (often ones designed to be exactly one word long). Not impossible, but more effort than just changing a constant.
The reason for the limit in the Windows NT kernel is that various things use bit masks with processor IDs as the indexes. For example, when defining processor affinity set you have an n-bit bitfield (one bit per supported processor), with the bit set if the thread is allowed to run on that processor. At 256 bits (the current limit for Windows), these are already pretty large to scan (especially since the kernel isn't allowed to use SSE instructions, meaning that it's potentially got to be 4 64-bit lsb-tests to find the next core to use).
I am TheRaven on Soylent News
Well that would be true, but the really complex x86 instructions are rarely used, so you're not really adding much in the way of code density, and you have to add a lot of hardware complexity to decode it. Not only that, more complex instructions mean bigger pipelines which mean bigger branch penalties.
The paper referenced in the arcticle can be found here.
Fascinating that MPI works that well unmodified.
"Lazy CPU designers" hah!
GPUs are severely limited on their types of tasks. Instead of a 1600 core GPU, pretend your CPU has as single large SIMD register that can hold 1600 floats. Now, it would be great at crunching large matrices of floats and utterly suck at everything else. That's a GPU in a nutshell. GPUs are absolutely horrible at branches. If one core takes a branch, every core in that core's group must stall and wait for the branch to finish. All cores must be working on the same instruction at the same time and branches mess with that. GPUs != CPUs
When Intel last talked about their 80 core CPUs, they talked about getting rid of cache-coherency in order to scale, AMD also recognizes this. This would mean OSs and most Apps would NOT be backwards compatible even if still using x86 instructions. Although, it would be possible to run apps in a VM that emulated cache-coherency.