Intel Talks 1000-Core Processors

Jeez... by Joe+Snipe · 2010-11-21 18:54 · Score: 5, Funny

I hope he never works for Gillette.

--
Sometimes, life itself is sarcasm...

Re:Jeez... by Monkey-Man2000 · 2010-11-21 19:10 · Score: 3, Funny

I hope he never works for Gillette.
Obligatory Onion

--
This post was generated by a Cadre of Uber Monkeys for Monkey-Man2000 (603495).
Re:Jeez... by monkeySauce · 2010-11-21 19:15 · Score: 3, Funny

Other way around; he used to work for Gillette. He left after they cancelled his 1000-blade razor project.
Re:Jeez... by Slashcrunch · 2010-11-21 19:43 · Score: 4, Funny

Other way around; he used to work for Gillette. He left after they cancelled his 1000-blade razor project.
Yes, I also heard about the 1000-blade project getting cut...
Re:Jeez... by Bill+Dog · 2010-11-21 21:15 · Score: 4, Funny

...in the nick of time.

--
Attention zealots and haters: 00100 00100

Message passing between cores? Hmm... by PaulBu · 2010-11-21 19:00 · Score: 3, Interesting

Are they trying to reinvent Transputer? :)

But yes, I am happy to see Intel pushing it forward!

Paul B.

Re:Message passing between cores? Hmm... by TinkersDamn · 2010-11-21 20:37 · Score: 2, Interesting

Yes, I've been wondering the same thing. Transputers contained key ideas that seem to be coming around again...
But a more crucial thing might be how much heat can you handle on one chip? These guys are already at 25-125 watts, likely depending on how many cores are actually turned on. After all they're playing pretty hefty heat management tricks on current i7's and Phenom's.
http://techreport.com/articles.x/15818/2
What use are 48 cores, let alone 1000 if they're all being slowed down to 50% or whatever by heat and power juggling?
Re:Message passing between cores? Hmm... by Anne+Thwacks · 2010-11-21 23:35 · Score: 3, Informative

They were apart from the comms protocol, which was a pile of poo.
IF YOU GOT A COMMS ERROR, THE ONLY RECOVERY MECHANISM WAS A TOTAL SYSTEM REBOOT.
That is as crap as you can get! TPP/IP might be an improvement, but HDLC would have cracked the Transputer's problems, and it was already over 15 years old when the transputer was invented.
Yes I did build a Transputer based system, and yes it did work. (but...)

--
Sent from my ASR33 using ASCII
Re:Message passing between cores? Hmm... by Joce640k · 2010-11-22 00:16 · Score: 2, Informative

I had a board with 4 T800s in my 286 PC, I wrote a raytracer for it.
The chips were OK but the compilers and development kit were terrible.

--
No sig today...

Could be good for games using raytracing by mentil · 2010-11-21 19:01 · Score: 4, Insightful

This is for server/enterprise usage, not consumer usage. That said, it could scale to the number of cores necessary to make realtime raytracing work at 60fps for computer games. Raytracing could be the killer app for cloud gaming services like OnLive, where the power to do it is unavailable for consumer computers, or prohibitively expensive. The only way Microsoft etc. would be able to have comparable graphics in a console in the next few years is if it were rental-only like the Neo-Geo originally was.

--
Corruption is convincing someone that the selfless ideal is the same as their selfish ideal.

Obligatory XKCD by Anonymous Coward · 2010-11-21 19:02 · Score: 2, Funny

http://xkcd.com/619/

Bring out your Memes! by SixDimensionalArray · 2010-11-21 19:07 · Score: 4, Funny

Imagine a Beowulf cluster of th^H^H^H

Ah, forget it, the darn thing practically is one already! :/

"Imagine exactly ONE of those" just doesn't sound the same.

Re:Bring out your Memes! by rrohbeck · 2010-11-21 20:55 · Score: 3, Funny

I've said it for years: 640K cores ought to be enough for anybody.

--
thegodmovie.com - watch it

accurate representation by pyronordicman · 2010-11-21 19:12 · Score: 5, Interesting

Having been in attendance of this presentation at Supercomputing 2010, for once I can say without a doubt that the article captured the essence of reality. The only part it left out is that the interconnect between all the processing elements uses significantly less energy than that of the previous 80-core chip; I think the figure was around 10% of chip power for the 48-core, and 30% for the 80-core. Oh, and MPI over TCP/IP was faster than the native message passing scheme for large messages.

You wanna impress me? by Anonymous Coward · 2010-11-21 19:30 · Score: 2, Funny

Make a processor with four asses.

Future of Programming by igreaterthanu · 2010-11-21 19:31 · Score: 4, Interesting

This just goes to show that if you care about having a future career (or even just continuing with your existing one) in programming, Learn a functional language NOW!

--
I dream of a nation where a man is not judged by his skin color but by an number assigned by a credit rating agency.

Re:Future of Programming by jamesswift · 2010-11-21 20:37 · Score: 2, Interesting

It's quite something isn't it, how so few people on even slashdot seem to get this. Old habits die hard I guess.
Years ago a clever friend of mine clued me into how functional was going to be important.
He was so right and the real solutions to concurrency (note, not parallelism which is easy enough in imperative) are in the world of FP or at least mostly FP.
My personal favourite so far is Clojure which has the most comprehensive and realistic approach to concurrency I've seen yet in a language ready for real world work.
The key thing to learn from it is how differently you need to approach your problem to take advantage of a mutli-core world.
Clojure itself may never become a top-5 language but they way it approaches the problem surely will be seen in other future FP langs.

--
i wish i could stop
Re:Future of Programming by Anonymous Coward · 2010-11-21 20:43 · Score: 5, Insightful

Learn a functional language. Leanr it not for some practical reason. Learn it because having another view will give you interesting choices even when writing imperative languages. Every serious programmer should try to look at the important paradigms so that he can freely choose to use them where appropriate.

Re:Imagine by JWSmythe · 2010-11-21 19:33 · Score: 2, Interesting

Why? :) I know. meme. It's just, I've built a couple Beowulf clusters for fun, and didn't have an application written to use MPI (or any of the alphabet soup of protocols), so it was just an exercise, not for any practical use. It's not like most of us are crunching numbers hard enough to need one, and it won't help out playing games or even building kernels.

I'd like to see a 1k core machine on my desktop, but that's beyond the practical limits of any software currently available. Linux can only go to 256 cores. Windows 2008 tops out at 64. But hey, if they did come to market, I know who would be first to support all those cores, and it doesn't come from Redmond (or their offshore outsourced developers).

--
Serious? Seriousness is well above my pay grade.

Re:One question? by JWSmythe · 2010-11-21 19:37 · Score: 2, Insightful

The only thing I'd be compensating for is the fact I can't do calculations at Exaflop rates in my head.

Just like my car only compensates for the fact I can't run at 165mph. :)

--
Serious? Seriousness is well above my pay grade.

Re:Biggest Hurdle Not Cores by Anonymous Coward · 2010-11-21 19:38 · Score: 3, Insightful

Basically, we are going to need compilers that automatically take advantage of all that parallelism without making you think about it too much, and programming languages that are designed to make your programs parallel-friendly. Even Microsoft is finally starting to edge in this direction with F# and some new features of .NET 4.0. Look at Haskell and Erlang for examples of languages that take such things more seriously, even if the world takes them less seriously.

I don't know about AI, but almost certainly we will end up with both compilers and virtual machines that are aware of parallelism and try to take advantage of it whenever possible.

But still, certain algorithms just aren't very friendly to parallelism no matter what technology you apply to them.

Re:does it run Linux - yea but it is "boring" by RAMMS+EIN · 2010-11-21 19:45 · Score: 4, Interesting

Running Linux on a 48-core system is boring, because it has already been run on a 64-core system in 2007 (at the time, Tilera said they would be up to 1000 cores in 2014; they're up to 100 cores per CPU now).

As far as I know, Linux currently supports up to 256 CPUs. I assume that means logical CPUs, so that, for example, this would support one CPU with 256 cores, or one CPU with 128 cores with two CPU threads per core, etc.

--
Please correct me if I got my facts wrong.

1000 cores is easy! by Jason+Kimball · 2010-11-21 19:46 · Score: 5, Funny

1000 cores on a chip isn't too bad. I already have one with 110 cores.

That's only 10 more cores!

Re:1000 cores is easy! by prider · 2010-11-21 21:52 · Score: 2, Funny

Only 10 types of people caught that..
Re:1000 cores is easy! by roman_mir · 2010-11-22 00:26 · Score: 3, Funny

That has got to be the funniest thing I've read here in a month.
- jesus, that must have been one sad month.

--
You can't handle the truth.

Instruction set... by KonoWatakushi · 2010-11-21 19:47 · Score: 3, Insightful

"Performance on this chip is not interesting," Mattson said. It uses a standard x86 instruction set.

How about developing a small efficient core, where the performance is interesting? Actually, don't even bother; just reuse the DEC Alpha instruction set that is collecting dust at Intel.

There is no point in tying these massively parallel architectures to some ancient ISA.

Re:Instruction set... by kohaku · 2010-11-21 21:47 · Score: 4, Insightful

There's also no reason to throw away an ISA that has proven to be extremely scalable and very successful, just because it's ancient or it looks ugly.
Uh, scalable? Not really... The only reason x86 is still around (i.e. successful) is because it's pretty much backwards compatible since the 8086- which is over THIRTY YEARS OLD.

The advantage of the x86 instruction set is that it's very compact. It comes at a price of increased decoding complexity, but that problem has already been solved.
Whoa nelly. compact? I'm not sure where you got that idea, but it's called CISC and not RISC for a reason! if you think x86 is compact, you might be interested to find out that you can have a fifteen byte instruction In fact, on the i7 line, the instructions are so complex it's not even worth writing a "real" decoder- they're translated in real-time into a RISC instruction set! If Intel would just abandon x86, they could reduce their cores by something like 50%!
The low number of registers _IS_ a problem. The only reason there are only four is because of backwards compatability. It definitely is a problem for scalability, one cannot simply rely on a shared memory architecture to scale vertically indefinitely, you just use too much power as a die size increases, and memory just doesn't scale up as fast as the number of transistors on a CPU.
A far better approach is to have a decent model of parallelism (CSP, Pi-calculus, Ambient calculus) underlying the architecture and to provide a simple architecture with primitives supporting features of these calculi, such as channel communication. There are plenty of startups doing things like this, not just Intel, and they've already products in the market- though not desktop processors. Picochip and Icera to name just a couple, not to mention things like GPGPU (Fermi, etc.)
Really, the way to go is small, simple, low power cores with on-chip networks which can scale up MUCH better than just the old intel method of "More transistors, increase clock speed, bigger cache".
Re:Instruction set... by Arlet · 2010-11-21 22:35 · Score: 3, Insightful

The only reason x86 is still around (i.e. successful) is because it's pretty much backwards compatible since the 8086- which is over THIRTY YEARS OLD.
That's a clear testament to scalability when you consider the speed improvement in the last 30 years using basically the same ISA.

you might be interested to find out that you can have a fifteen byte instruction
So ? It's not the maximum instruction length that counts, but the average. In typical programs that's closer to three. Frequently used opcodes like push/pop only take a single byte. Compare to a DEC Alpha architecture, where nearly every single instruction uses 15 bits just to tell which registers are used, no matter whether a function needs that many registers.

If Intel would just abandon x86, they could reduce their cores by something like 50%!
Even if that's true (I doubt it), who cares ? The problem is not intel has too many transistors for a given area. The problem is just the opposite. They have the capability to put more transistors in a core that they know what to do with. Also, typically half the chip is for the cache memories, and the compact instruction set helps to use that cache memory more effectively.

one cannot simply rely on a shared memory architecture to scale vertically indefinitely
Sure you can. Shared memory architectures can do everything explicit channel communication architectures can do, plus you have the benefit that the communication details are hidden from the programmer, allowing improvements to the implementation without having to rewrite your software. Sure, the hardware is more complex, but transistors are dirt cheap, so I'd rather put the complexity in the hardware.
Re:Instruction set... by kohaku · 2010-11-21 23:24 · Score: 3, Interesting

That's a clear testament to scalability when you consider the speed improvement in the last 30 years using basically the same ISA.
It's scaled that way until now. We've hit a power wall in the last few years: as you increase the number of transistors on chip it gets more difficult to distribute a faster clock synchronously, so you increase the power, which is why Nehalem is so power hungry, and why you haven't seen clock speeds really increase since the P4. In any case, we're talking about parallelism, not just "increasing the clock speed" which isn't even a viable approach anymore.
When you said "Compact" I assumed you meant the instruction set itself was compact rather than the average length- I was talking about the hardware needed to decode, not necessarily code density. Even so, x86 is nothing special when it comes to density, especially considered against things like ARM's Thumb-2.
If you take look at Nehalem's pipeline, there's a significant chunk of it simply dedicated to translating x86 instructions into RISC uops, which is only there for backwards compatability. The inner workings of the chip don't even see x86 instructions.
Sure you can do everything the same with shared memory and channel comms, but if you have a multi-node system, you're going to be doing channel communcation anyway. You also have to consider that memory speed is a bottleneck that just won't go away, and for massive parallelism on-chip networks are just faster. In fact, Intel's QPI and AMD's HyperTransport are examples of on-chip network- they provide a NUMA on Nehalem and whatever AMD have these days. Indeed, in the article, it says

Mattson has argued that a better approach would be to eliminate cache coherency and instead allow cores to pass messages among one another.

The thing is, if you want to put more cores on a die, you need either a bigger die or smaller cores. x86 is stuck with larger cores because of all the translation and prediction it's required to do to be both backwards compatible and reasonably well-performing. If you're scaling horizontally like that, you want the simplest core possible, which is why this chip only has 48 cores, and Clearspeed's 2-year-old CSX700 had 192.
Re:Instruction set... by Arlet · 2010-11-21 23:49 · Score: 4, Interesting

The thing is, if you want to put more cores on a die, you need either a bigger die or smaller cores
Nobody wants to put more cores on a die, but they're forced to do so because they reach the limits of a single core. I'd rather have as few cores as possible, but have each one be really powerful. Once multiple cores are required, I'd want them to stretch the coherent shared memory concept as far as it will go. When that concept doesn't scale anymore, use something like NUMA.
Small, message passing cores have been tried multiple times, and they've always failed. The problem is that the requirement of distributed state coherency doesn't go away. The burden only gets shifted from the hardware to the software, where it is just as hard to accomplish, but much slower. In addition, if you try to tackle the coherency problem in software, you don't get to benefit from hardware improvements.
Re:Instruction set... by kohaku · 2010-11-22 00:33 · Score: 3, Interesting

they're forced to do so because they reach the limits of a single core
Well yes, but you might as well have argued that nobody wanted to make faster cores but they're limited by current clock speeds... The fact is that you can no longer make cores faster and bigger, you have to go parallel. Even the intel researcher in the article is saying the shared memory concept needs to be abandoned to scale up.
Essentially there are two approaches to the problem of performance now. Both use parallelism. The first (Nehalem's) is to have a 'powerful' superscalar core with lots of branch prediction and out-of-order logic to run instructions from the same process in parallel. It results in a few, high performance cores that won't scale horizontally (memory bottleneck)
The second is to have explicit hardware-supported parallelism with many many simple RISC or MISC cores on an on-chip network. It's simply false to say that small message passing cores have failed. I've already given examples of ones currently on the market (Clearspeed, Picochip, XMOS, and Icera to an extent). It's a model that has been shown time and time again to be extremely scalable, in fact it was done with the Transputer in the late 80s/early 90s. The only reason it's taking off now is because it's the only way forward as we hit the power wall, and shared memory/superscalar can't scale as fast to compete. The reason things like the Transputer didn't take off in mainstream (i.e. desktop) applications is because they were completely steamrolled by what x86 had to offer: an economy of scale, the option to "keep programming like you've always done", and most importantly backwards compatability. In fact they did rather well in i/o control for things such as robotics, and XMOS continues to do well in that space.
The "coherency problem" isn't even part of a message passing architecture because the state is distributed amongst the parallel processes. You just don't program a massively parallel architecture in the same way as a shared memory one.
Re:Instruction set... by kohaku · 2010-11-22 01:05 · Score: 2, Interesting

There's a third option: combine the best of both worlds. Use powerful, superscalar cores with shared memory, as powerful as you can reasonably make them, and then run clusters of those in parallel.
Which is of course what is already being done, but whether that's the best approach remains to be seen. Communication is always the bottleneck in HPC systems, and many processors on chip with a fast interconnect seems to do very well, at least for Picochip (though it is a DSP chip, I think it's a valid comparison).

Well, there's your problem. Many real world applications can only be programmed that way.
Examples? It's just a different model, it's doesn't prevent you solving any problem.

The ClearSpeed 192-core CSX700 is on the market, but nobody is buying it
Yeah, that was a shame. The trouble is that HPC-specific chips are just going to get steamrolled on the price point by commodity (x86) hardware. But what about the other three that are selling like hotcakes?
Re:Instruction set... by Arlet · 2010-11-22 02:26 · Score: 2, Insightful

Examples? It's just a different model, it's doesn't prevent you solving any problem.
A typical consumer desktop machine, running typical programs for instance. In order to use these cores effectively, all these programs need to rewritten. Imagine your word processor reformatting a 500 page document on 1000 cores. It's just not going to work very well.
How about the operating system ? 1000 different cores all trying to access a file system on a single physical drive. How are you going to run that efficiently ?
Re:Instruction set... by Arlet · 2010-11-22 03:32 · Score: 2, Interesting

We way we do it now is a single filesystem layer which is, at all times, in a single coherent state. With today's shared memory systems, and cache coherency guaranteed by the hardware, that's reasonable easy to accomplish.
The current filesystem concept just doesn't map onto 1000 non-coherent cores.

Re:Temperature? by TapeCutter · 2010-11-21 19:50 · Score: 2, Funny

1.3 billion transitors!!! When I was a kid we had 9 and you could open the box and count 'em.

--
And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.

1000 cores is nothing by Anonymous Coward · 2010-11-21 19:55 · Score: 5, Interesting

Probably in future 1 million cores is minimum requirement for applications. We will then laugh for these stupid comments...

Image and audio recognition, true artificial intelligence, handling data from huge amount of different kind of sensors, movement of motors (robots), data connections to everything around the computer, virtual worlds with thousands of AI characters with true 3D presentation... etc...etc... will consume all processing power available.

1000 cores is nothing... We need much more.

Re:1000 cores is nothing by Electricity+Likes+Me · 2010-11-21 20:48 · Score: 2, Insightful

1000 cores at 1Ghz on a single chip, networked to a 1000 other chips, would probably just about make a non-real time simulation of a full human brain possible (going off something I read about this somewhere). Although if it is possible to arbitrarily scale the number of cores, then we might be able to seriously consider building a system of very simple processors acting as electronic neurons.
Re:1000 cores is nothing by pitchpipe · 2010-11-22 07:03 · Score: 2, Funny

Yes, an while we are at it gat a working EMH/ECH(star-trek voyager) and a mobuile emitor :)
Or a working spell check program ;-)

--
Look where all this talking got us, baby.

Re:Workaround, yeah by wierd_w · 2010-11-21 20:19 · Score: 5, Informative

You've obviously never worked in Aerospace.

I can bring a quad core Xeon system to its knees running Catia. (I mean, 100% saturation, all 4 cores, with IO contention.) I do it fairly regularly too.

Might have something to do with the NP-Hard problem of resolving tangencies on extremely complex nurbs surfaces. (aircraft skins).

Granted, that is not a "normal" workstation; But I would be VERY happy indeed to have a 1000 core workstation at my disposal. Maybe then I could actually work with Gulfstream's horrible part models where they include literally the whole god-damn aircraft's surface geometry in the digital part model for a fucking bolt. (Guess what happens when you load several such models, and digitally assemble them. I have seen a 64 bit workstation allocate over 8gb of swap because of them and their dumbassery.)

Now, if I could get one with over 1TB of RAM installed too, then I'd be in business.

"Build it and they will come" - NOT by Animats · 2010-11-21 20:34 · Score: 4, Informative

It's an interesting machine. It's a shared-memory multiprocessor without cache coherency. So one way to use it is to allocate disjoint memory to each CPU and run it as a cluster. As the article points out, that is "uninteresting", but at least it's something that's known to work.

Doing something fancier requires a new OS, one that manages clusters, not individual machines. One of the major hypervisors, like Xen, might be a good base for that. Xen already knows how to manage a large number of virtual machines. Managing a large number of real machines with semi-shared memory isn't that big a leap. But that just manages the thing as a cluster. It doesn't exploit the intercommunication.

Intel calls this "A Platform for Software Innovation". What that means is "we have no clue how to program this thing effectively. Maybe academia can figure it out". The last time they tried that, the result was the Itanium.

Historically, there have been far too many supercomputer architectures roughly like this, and they've all been duds. The NCube Hypercube, the Transputer, and the BBN Butterfly come to mind. The Cell machines almost fall into this category. There's no problem building the hardware. It's just not very useful, really tough to program, and the software is too closely tied to a very specific hardware architecture.

Shared-memory multiprocessors with with cache coherency have already reached 256 CPUs. You can even run Windows Server or Linux on them. The headaches of dealing with non-cache-coherent memory may not be worth it.

Re:Imagine by seifried · 2010-11-21 20:37 · Score: 5, Informative

Linux can only go to 256 cores.

Uhmm no.

./arch/ia64/Kconfig: int "Maximum number of CPUs (2-4096)"
/arch/powerpc/platforms/Kconfig.cputype: int "Maximum number of CPUs (2-8192)"

In x86 we have:

config MAXSMP
bool "Enable Maximum number of SMP Processors and NUMA Nodes"
depends on X86_64 && SMP && DEBUG_KERNEL && EXPERIMENTAL

And I believe you can crank that dial all the way up

Also consider this: the number of cores in my desktop is doubling every year or two (and this is with a single core chip), 6 and 8 cores are cheap now, so we'll be at 1024 in roughly 7-14 years which makes sense because the GHz war is done and simply making more cores is relatively cheap (once you have the interconnect making a bigger CPU isn't all that hard).

Re:Temperature? by c0lo · 2010-11-21 20:40 · Score: 4, Interesting

Dude, what the fuck, that's only 48 cores. How does that get you anywhere close to 1000?

Well, Watson, that's elementary...

The correct question should have been: "How many watts one needs to dissipate"... because the temperature is given by "How high and still have the transistors working".
In regards with the power dissipation: the architecture would have a common component (event passing, RAM fetches, etc) and N cores. Assuming each core needs to dissipate the same power (say, at peak utilization) and assuming the 25-125 Watts being the range defined by "1 core used" to "all 48 cores used", some simple linear algebra gives: power dissipated/core approx 2 watts (a bit more actually) with the "common component" eating approx 23 Watts.
Therefore, on top of the computation benefits derived from fully utilizing 1000 cores, one would have a pretty good heat source: 2150 Watts or so. One's choice what to do with it, but it's far too high for a domestic-sized slow cooker (the dished would come with a weird burned taste).

Satisfied, now?

If not, to put the things in perspective, assuming our ancestors (that could use only horses as a source of power) would have wanted to use this computer, they's need approx. 2.68 horses... but hey, wow... what a delight to play the MMORPG so smooth... especially in "farming/grinding" phases.

PS. the above computations are meant to be funny and/or an exercise of approximating based on insufficient data and/or vent some frustration caused by "all work and no play", definitely a wasted time... Ah, yes, some karma would be nice, but not mandatory.

--
Questions raise, answers kill. Raise questions to stay alive.

I/O and memory bandwidth by francium+de+neobie · 2010-11-21 20:44 · Score: 3, Insightful

Ok, you can cram 1000 cores into one CPU chip - but feeding all 1000 CPU cores with enough data for them to process and transferring all the data they spit out is gonna be a big problem. Things like OpenCL work now because the high end GPUs these days have 100GB/s+ bandwidth to the local video memory chips, and you're only pulling out the result back into system memory after the GPU did all the hard work. But doing the same thing on a system level - you're gonna have problems with your usual DDR3 modules, your SSD hard disk (even PCI-E based) and your 10GE network interface.

Deja Vu from a decade ago by Baldrson · 2010-11-21 20:55 · Score: 2, Informative

It seem like I've been here before.

A little while ago you asked Forth (and now colorForth) originator Chuck Moore about his languages, the multi-core chips he's been designing, and the future of computer languages -- now he's gotten back with answers well worth reading, from how to allocate computing resources on chips and in programs, to what sort of (color) vision it takes to program effectively. Thanks, Chuck!

--
Seastead this.

Re:Imagine by visualight · 2010-11-21 20:58 · Score: 2, Interesting

http://www.sgi.com/products/servers/altix/uv/

2,048 cores (256 sockets) and 16TB of memory, one OS image.

--
Samsung took back my unlocked bootloader because Google wants me to rent movies. They're both evil.

This is NOT a cache-coherent/SMP machine! by Terje+Mathisen · 2010-11-21 20:58 · Score: 2, Insightful

The key difference between this research chip and the other Multicore chips Intel have worked on, like Larrabee, is that it is explicitly NOT cache coherent, i.e. it is a cluster on chip instead of a single-image multi-processor.

This means, among many other things, that you cannot load a single Linux OS across all the cores, you need a separate executive on every core.

Compare this with the 7-8 Cell cores in a PS3.

Terje

--
"almost all programming can be viewed as an exercise in caching"

Remember the last couple of times this happened? by Required+Snark · 2010-11-21 21:02 · Score: 5, Informative

This is at least the third time that Intel has said that it is going to change the way computing is done.

The first time was the i432 http://en.wikipedia.org/wiki/Intel_iAPX_432 Anyone remember that hype? Got to love the first line of the Wikipedia article "The Intel iAPX 432 was a commercially unsuccessful 32-bit microprocessor architecture, introduced in 1981."

The second time was the Itanium (aka Itanic) that was going to bring VLIW to the masses. Check out some of the juicy parts of the timeline also over on Wikipedia http://en.wikipedia.org/wiki/Itanium#Timeline

1997 June: IDC predicts IA-64 systems sales will reach $38bn/yr by 2001

1998 June: IDC predicts IA-64 systems sales will reach $30bn/yr by 2001

1999 October: the term Itanic is first used in The Register

2000 June: IDC predicts Itanium systems sales will reach $25bn/yr by 2003

2001 June: IDC predicts Itanium systems sales will reach $15bn/yr by 2004

2001 October: IDC predicts Itanium systems sales will reach $12bn/yr by the end of 2004

2002 IDC predicts Itanium systems sales will reach $5bn/yr by end 2004

2003 IDC predicts Itanium systems sales will reach $9bn/yr by end 2007

2003 April: AMD releases Opteron, the first processor with x86-64 extensions

2004 June: Intel releases its first processor with x86-64 extensions, a Xeon processor codenamed "Nocona"

2004 December: Itanium system sales for 2004 reach $1.4bn

2005 February: IBM server design drops Itanium support

2005 September: Dell exits the Itanium business

2005 October: Itanium server sales reach $619M/quarter in the third quarter.

2006 February: IDC predicts Itanium systems sales will reach $6.6bn/yr by 2009

2007 November: Intel renames the family from Itanium 2 back to Itanium.

2009 December: Red Hat announces that it is dropping support for Itanium in the next release of its enterprise OS

2010 April: Microsoft announces phase-out of support for Itanium.

So how do you think it will go this time?

--
Why is Snark Required?

cue kilocore debates by bingoUV · 2010-11-21 21:31 · Score: 2, Interesting

Do 1024 cores constitute a kilocore? Or 1000? I'd love to see that debate move from hard disks to processors.

--
Bingo Dictionary - Pragmatist, n. A myopic idealist.

Re:Imagine by pyalot · 2010-11-21 21:33 · Score: 4, Informative

You're having a supercomputer on your desk right now. It's called a "GPU", and most likely, it sports many hundred cores. Oh, and the killer app you mean, that's whatever latest DX11/Opengl4 game you prefer.

--
Experiments and other stuff

Re:does it run Linux - yea but it is "boring" by vojtech · 2010-11-21 21:34 · Score: 4, Informative

The current limit on Linux (with 2.6 series) is 8192 CPUs on POWER and 4096 on x86. And there are even a number of non-x86 machines today that reach these sizes in a cache-coherent (ccNUMA) manner that Linux works well on. You still have to be careful with application design, though, because it's fairly easy to hit bottlenecks either in the application or in the kernel that will limit scalability. Most common workloads are already seeing

Re:Imagine by Siffy · 2010-11-21 21:38 · Score: 2, Interesting

Why not? He/she built a cluster for no use at all other than learning and fun. I can easily see the "use" for 1k cores with Intel's apparent interest to get into the 3d market or at least destroy Nvidia and ATI (something AMD has already done in name but that's beside the point). For clusters it's a no-brainer to keep adding cores if you can increase performance per watt ratio with each additional core. For desktops there likely will be a point where enough is enough, but I disagree that we've passed it. Software designers are still keeping up quite quickly with any headroom new hardware creates.

Re:Imagine by bloodhawk · 2010-11-21 22:14 · Score: 2, Informative

Why? :) I know. meme. It's just, I've built a couple Beowulf clusters for fun, and didn't have an application written to use MPI (or any of the alphabet soup of protocols), so it was just an exercise, not for any practical use. It's not like most of us are crunching numbers hard enough to need one, and it won't help out playing games or even building kernels.

I'd like to see a 1k core machine on my desktop, but that's beyond the practical limits of any software currently available. Linux can only go to 256 cores. Windows 2008 tops out at 64. But hey, if they did come to market, I know who would be first to support all those cores, and it doesn't come from Redmond (or their offshore outsourced developers).

ummm no. Windows 2008 can handle 64 SOCKETS, it currently scales to 256 cores

Re:Workaround, yeah by MichaelSmith · 2010-11-21 22:34 · Score: 4, Interesting

In my field it would be real time conflict detection between aircraft. The better your conflict detection, the more aircraft you can pack in to small volumes of space. There is a lot of money in that.

--
http://michaelsmith.id.au

Real time raytracing of course! by Joce640k · 2010-11-21 22:34 · Score: 2, Informative

Isn't that Intel's pet project for the last decade?

--
No sig today...

Re:Imagine by nikanth · 2010-11-21 22:50 · Score: 2, Informative

Linux can only go to 256 cores. Windows 2008 tops out at 64.

Linux supports more than 256 cores.

MAINLINE:

Maximum number of CPUs / CONFIG_NR_CPUS:

This allows you to specify the maximum number of CPUs which this kernel will support. The maximum supported value is 512 and the minimum value which makes sense is 2. This is purely to save memory - each supported CPU adds approximately eight kilobytes to the kernel image.

I know SGI has systems running 4096 CPUs with SUSE Linux.

Re:Imagine by bertok · 2010-11-21 23:10 · Score: 4, Interesting

depends on X86_64 && SMP && DEBUG_KERNEL && EXPERIMENTAL

And I believe you can crank that dial all the way up

Also consider this: the number of cores in my desktop is doubling every year or two (and this is with a single core chip), 6 and 8 cores are cheap now, so we'll be at 1024 in roughly 7-14 years which makes sense because the GHz war is done and simply making more cores is relatively cheap (once you have the interconnect making a bigger CPU isn't all that hard).

Don't you worry, the GHz war is not done!

There's talk of exotic materials (SiC, diamond, etc...) going to 10 GHz. If someone figures out how to make the Rapid Single Flux Quantum digital chips with high temperature superconductors, then we may seriously start to see 1 THz clock speeds in practical computers, using extreme Peltier cooling to get the CPU core down to cryogenic temps.

Re:Imagine by TheRaven64 · 2010-11-21 23:55 · Score: 4, Interesting

Pretty much anything that I've written in Erlang uses (at least) a few thousand concurrent processes. I've never tried running it on more than a 64-core machine, but when I moved stuff from my single-core laptop to a 64-core SGI machine the load was pretty evenly distributed.

It's pretty easy to write concurrent code that scales as long as you respect one rule: No data may be both mutable and aliased. You can do this in object-oriented languages with the actor model, but languages like Erlang enforce it for you (at the cost of a few redundant copies).

--
I am TheRaven on Soylent News

In the near future... by rebelwarlock · 2010-11-21 23:59 · Score: 4, Funny

I will need to buy a pair of sunglasses, and crush them when I find that the new Intel processor has over 9000 cores.

Re:does it run Linux - yea but it is "boring" by TheRaven64 · 2010-11-22 00:02 · Score: 2, Informative

The current limit on Linux (with 2.6 series) is 8192 CPUs on POWER and 4096 on x86

That's kind-of true, but quite misleading. 8192 is the hard limit, but scheduler and related overhead means that the performance gets pretty poor long before then. Please don't cite the big SGI and IBM machines as counter examples. The SGI machines effectively run a cluster OS, but with hardware distributed shared memory. They are 'single system image' in that they appear to be one OS to the user, but each board has its own kernel, I/O peripherals and memory and works largely independently except when accessing data from a remote node (handled by the hardware) or migrating processes to another node (kernel initiates this when it's too heavily loaded on a single node).

The big IBM machines have a similar design, although their big supercomputers don't actually run Linux at all in a meaningful sense. They run no OS on the processors that do the real work - the big compute jobs run without anything interrupting them or competing for CPU time - and run Linux on the coprocessors that handle I/O.

--
I am TheRaven on Soylent News

Re:Why is 8192 a hard limit? by TheRaven64 · 2010-11-22 00:36 · Score: 3, Informative

The kernel needs some data structures per processor. 8192 means it needs a 15-bit index for them. I'm not certain about the Linux kernel, but in other kernels it's quite common for this to be squeezed in to other values for various reasons, so adding more processors requires you to either increase the size of other data structures (often ones designed to be exactly one word long). Not impossible, but more effort than just changing a constant.

The reason for the limit in the Windows NT kernel is that various things use bit masks with processor IDs as the indexes. For example, when defining processor affinity set you have an n-bit bitfield (one bit per supported processor), with the bit set if the thread is allowed to run on that processor. At 256 bits (the current limit for Windows), these are already pretty large to scan (especially since the kernel isn't allowed to use SSE instructions, meaning that it's potentially got to be 4 64-bit lsb-tests to find the next core to use).

--
I am TheRaven on Soylent News

Re:RISC has downsides... by kohaku · 2010-11-22 00:41 · Score: 2, Informative

It's more efficient to have instructions which take maybe a bit more bits, but on average they don't really take that much more and have microcode on-die to handle them

Well that would be true, but the really complex x86 instructions are rarely used, so you're not really adding much in the way of code density, and you have to add a lot of hardware complexity to decode it. Not only that, more complex instructions mean bigger pipelines which mean bigger branch penalties.

Re:does it run Linux - yea but it is "boring" by vojtech · 2010-11-22 01:06 · Score: 2, Interesting

Well, well, I hit the Submit button too soon. Anyway, most common workloads are already seeing decreasing benefits around 32 parallel threads.

For us not at SC10 by Eladith · 2010-11-22 01:10 · Score: 3, Informative

The paper referenced in the arcticle can be found here.

Fascinating that MPI works that well unmodified.

Re:Imagine by chrysrobyn · 2010-11-22 01:37 · Score: 5, Interesting

Don't you worry, the GHz war is not done! There's talk of exotic materials (SiC, diamond, etc...) going to 10 GHz. If someone figures out how to make the Rapid Single Flux Quantum [wikipedia.org] digital chips with high temperature superconductors, then we may seriously start to see 1 THz clock speeds in practical computers, using extreme Peltier cooling to get the CPU core down to cryogenic temps.

The GHz war is over. The speed of light won. A long time ago, it stopped being "all about the transistor" and started being "all about the wires". IBM won the race to copper in 180nm (back when it was 0.18um), and that helped make those technologies even better, but about the time we hit 90nm, semiconductors were "fast enough", or even by some measurements stopped being able to speed up. Since then, almost all speed increases have been largely (but not exclusively) due to the transistors getting smaller, reducing the distance wires need to go.

The RC delay of wires is the major problem. R isn't going to be getting much better than copper. Silver has a lower resistance by a little bit, but it's too reactive to be used anywhere real. In these geometries, any alloy would be insufficiently mixable to be reliable, to say nothing about more exotic materials (like ceramics). There's some room for improvement in the dielectric (the "C"), but by the time you make a box with corners covering water permeability, thermal coefficient of expansion close to the wires, mechanical properties friendly to sub micron manufacturing, you have to concede you're not going to be able to get more than 20% faster there (and that we could dispute separately).

Take a cache. The slowest path is having a memory cell read. That tiny little device needs to have a measurable change in voltage on the bitlines, and be sensed by a sensing structure. That sensing structure has nothing to do with storage, so it's pure overhead and thusly you want as few of them as possible. Can you have it 16 bits away? 32? The days are gone that it was 64 bits away for any meaningful performance. There's nothing you can do to the characteristics of that little device (which needs to be minimum feature size to maximize the density of the cache) to dominate over the characteristics of the bitline he's trying to affect.

Take a data path. Even if 95% of your data is highly predictable, easily pipelined stuff with local signals, your critical path is going to involve signals from other areas of the chip, and they're going to have to be rebuffered and trucked from hundreds of microns away. No giant buffer in the history of man can dominate over a long distance wire. The signal will show up "eventually".

3GHz is a good place to stop. We make it to 4GHz with compromises in power, but beyond that and you're dedicating so much of your chip to rebuffering that you're blowing a lot of power on that. At that point, your pipeline is so many stages that branch mispredicts are very painful. You're devoting so much of your cycle time to setup and holds for your latches that you're going to be embarassed at how little work you can do in each cycle.

1 THz clock speeds are on their way, and maybe even higher. But they're not useful to CPUs or GPUs. They're useful for more exotic applications, primarily technology demonstrations.

Re:gpu's have been doing this for years... by Bengie · 2010-11-22 01:45 · Score: 2, Informative

"Lazy CPU designers" hah!

GPUs are severely limited on their types of tasks. Instead of a 1600 core GPU, pretend your CPU has as single large SIMD register that can hold 1600 floats. Now, it would be great at crunching large matrices of floats and utterly suck at everything else. That's a GPU in a nutshell. GPUs are absolutely horrible at branches. If one core takes a branch, every core in that core's group must stall and wait for the branch to finish. All cores must be working on the same instruction at the same time and branches mess with that. GPUs != CPUs

When Intel last talked about their 80 core CPUs, they talked about getting rid of cache-coherency in order to scale, AMD also recognizes this. This would mean OSs and most Apps would NOT be backwards compatible even if still using x86 instructions. Although, it would be possible to run apps in a VM that emulated cache-coherency.

Re:Nobody wants to put more cores on a chip? by Arlet · 2010-11-22 03:43 · Score: 2, Interesting

The '93 era Pentium they're talking about only has 3 million transistors, and only a fraction are needed to handle the x86 instruction set. Current transistor count goes into the billions, so as far as real estate goes, you can put 1000 Pentium class cores on a single die, despite the x86 translations.

Of course, the whole concept of a 1000 cores running on a single die is only going to serve a small niche of applications.

Biggest problem and a fix... by Panaflex · 2010-11-22 03:44 · Score: 2, Interesting

IMHO the biggest problem with these multi-core chips is the lock latency. Locking in heap all works great, but a shared hw register of locks would save a lot of cache coherency and MMU copies.

A 1024 slot register with instruction support for mutex and read-write locks would be fantastic.

I'm developing 20+Gbps applications - we need fast locks and low latency. Snap snap!!!

--
I said no... but I missed and it came out yes.

Paraphrasing Torvalds... by menkhaura · 2010-11-22 03:47 · Score: 2, Insightful

Talk is cheap, show me the cores.

--
Stupidity is an equal opportunity striker.
Fellow slashdotter Bill Dog

Re:Imagine by TheRaven64 · 2010-11-22 04:22 · Score: 2, Interesting

I don't know what extra detail you need - the rule should be pretty self explanatory. If something is shared between two or more threads, it should be immutable. If something is mutable, only one thread / process should hold references to it.

The only exception to this rule is explicitly synchronised communication objects (message queues, process handles, and suchlike). If you follow this rule, then the only concurrency problems that you will have are caused by high-level design problems, rather than by low-level implementation problems.

Erlang enforces this by only having one mutable object: the process dictionary, which is only accessible by the process that owns it. Everything else is immutable.

--
I am TheRaven on Soylent News

Whats the point if Photoshop is only 2 processors by cdpage · 2010-11-22 04:42 · Score: 2, Insightful

Photoshop has been stuck at 2 processors for Way too long. Software companies have been lagging behind hardware far too long. Until I see See more software taking advantage of cores of more than 1 or 2... I'm not wasting money on them.

Benchmarks by Chemisor · 2010-11-22 04:47 · Score: 2, Insightful

According to benchmarks, a functional language like Erlang is slower than C++ by an order of magnitude. Sure, it can distribute processing over more cores, which is the only thing that enabled it to win one of the benchmarks. I suspect that was only because it used a core library function that was written in C. So no, if you want to write code with acceptable performance, DON'T use a functional language. All CPU intensive programs, like games, are written in C or C++; think about that.

Re:Imagine by PingPongBoy · 2010-11-22 05:39 · Score: 2, Funny

Why would you care to see one on your desktop? Do you have any use for one?

You got that right. I've never used more than 639 K of RAM either.

--
Know your pads. One time pad: good for cryptography. Two timing pad: where to take your mistress.

Re:does it run Linux - yea but it is "boring" by Zed+Pobre · 2010-11-22 08:59 · Score: 2, Funny

Most common workloads are already seeing

What? Tell me. WHAT ARE THEY SEEING?

... problems with data truncation.

Re:Imagine by Vegemeister · 2010-11-22 13:44 · Score: 2, Funny

Dammit dude. Blow the dust out of your case.

Slashdot Mirror

Intel Talks 1000-Core Processors

74 of 326 comments (clear)