Intel Talks 1000-Core Processors

Jeez... by Joe+Snipe · 2010-11-21 18:54 · Score: 5, Funny

I hope he never works for Gillette.

--
Sometimes, life itself is sarcasm...

Re:Jeez... by Monkey-Man2000 · 2010-11-21 19:10 · Score: 3, Funny

I hope he never works for Gillette.
Obligatory Onion

--
This post was generated by a Cadre of Uber Monkeys for Monkey-Man2000 (603495).
Re:Jeez... by monkeySauce · 2010-11-21 19:15 · Score: 3, Funny

Other way around; he used to work for Gillette. He left after they cancelled his 1000-blade razor project.
Re:Jeez... by NewbieProgrammerMan · 2010-11-21 19:22 · Score: 1

No worries, I think he works for Spishak now.

--
[b.belong('us') for b in bases if b.owner() == 'you']
Re:Jeez... by Slashcrunch · 2010-11-21 19:43 · Score: 4, Funny

Other way around; he used to work for Gillette. He left after they cancelled his 1000-blade razor project.
Yes, I also heard about the 1000-blade project getting cut...
Re:Jeez... by Bill+Dog · 2010-11-21 21:15 · Score: 4, Funny

...in the nick of time.

--
Attention zealots and haters: 00100 00100
Re:Jeez... by Aceticon · 2010-11-22 00:28 · Score: 1

Somebody should've told him that with 1000-cores per die you need fewer blades, not more.
Re:Jeez... by kostmo · 2010-11-22 00:55 · Score: 1

Insiders say the project was a victim of 'death by 1000 cuts'...
Re:Jeez... by DoofusOfDeath · 2010-11-22 01:49 · Score: 1

He heard Al Quaida's threat of "death by a thousand paper cuts" and thought, "amateurs."
Re:Jeez... by skiman1979 · 2010-11-22 05:30 · Score: 1

I remember back in the late 90's I was watching Saturday Night Live with my girlfriend. They cut to a 'commercial break' for Gillette where they introduced the Mach 20. They talked about how having 20 blades will make shaving so much better, much more close of a shave with the layers of skin being removed, etc. The skit itself was funny enough, but I really had a laugh when my girlfriend thought it was real.
"How can they sell a product like THAT?!?!"
She was fine once she realized it wasn't a real commercial.

--
Having a smoking section in a public restaurant is like having a peeing section in a public swimming pool.
Re:Jeez... by An+ominous+Cow+art · 2010-11-22 08:31 · Score: 1

That happens when you're working on the bleeding edge.
Re:Jeez... by ruthless+reader · 2010-11-22 08:52 · Score: 1

that would have been a sleek shiny razor with bloody smooth results

does it run Linux - yea but it is "boring" by G3ckoG33k · 2010-11-21 18:56 · Score: 1

From the article: "By installing the TCP/IP protocol on the data link layer, the team was able to run a separate Linux-based operating system on each core. Mattson noted that while it would be possible to run a 48-node Linux cluster on the chip, it "would be boring."

Huh?! Boring?! It would have been a nice a first post on Slashdot on the eternal topic - does it run Linux? - to begin with.

The we have all the programming goodies to follow up with.

Re:does it run Linux - yea but it is "boring" by c0lo · 2010-11-21 19:03 · Score: 1

From the article: "By installing the TCP/IP protocol on the data link layer, the team was able to run a separate Linux-based operating system on each core. Mattson noted that while it would be possible to run a 48-node Linux cluster on the chip, it "would be boring."
Huh?! Boring?! It would have been a nice a first post on Slashdot on the eternal topic - does it run Linux? - to begin with.
The we have all the programming goodies to follow up with.
;) To make the things interesting, each of the cores would have to use a public Inet IPv4 address.

--
Questions raise, answers kill. Raise questions to stay alive.
Re:does it run Linux - yea but it is "boring" by RAMMS+EIN · 2010-11-21 19:45 · Score: 4, Interesting

Running Linux on a 48-core system is boring, because it has already been run on a 64-core system in 2007 (at the time, Tilera said they would be up to 1000 cores in 2014; they're up to 100 cores per CPU now).
As far as I know, Linux currently supports up to 256 CPUs. I assume that means logical CPUs, so that, for example, this would support one CPU with 256 cores, or one CPU with 128 cores with two CPU threads per core, etc.

--
Please correct me if I got my facts wrong.
Re:does it run Linux - yea but it is "boring" by davester666 · 2010-11-21 20:00 · Score: 1

Of course, it will support PPP, namely pay-per-processor. You can have the first one cheap, the rest, not so much...

--
Sleep your way to a whiter smile...date a dentist!
Re:does it run Linux - yea but it is "boring" by Lumbre · 2010-11-21 20:01 · Score: 1

Yeah, it's boring with the same architecture we have now. But imagine if someone came up with a creative solution besides the current memory model. Memory management is probably hideous on a 1000-core system; they seemed to pose that lightly in the article.
This might even be a solution for a particular type of dedicated computer, not a personal computer.
Re:does it run Linux - yea but it is "boring" by Metabolife · 2010-11-21 20:58 · Score: 1

The most interesting part to me is how they're actually making a built in router for the chips. The cores communicate through TCP/IP. That's incredible.
Re:does it run Linux - yea but it is "boring" by fractoid · 2010-11-21 21:03 · Score: 1

So what you're getting at here is... you want MORE than one instance of the OS for each core? :P

--
Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.
Re:does it run Linux - yea but it is "boring" by vojtech · 2010-11-21 21:34 · Score: 4, Informative

The current limit on Linux (with 2.6 series) is 8192 CPUs on POWER and 4096 on x86. And there are even a number of non-x86 machines today that reach these sizes in a cache-coherent (ccNUMA) manner that Linux works well on. You still have to be careful with application design, though, because it's fairly easy to hit bottlenecks either in the application or in the kernel that will limit scalability. Most common workloads are already seeing
Re:does it run Linux - yea but it is "boring" by zdzichu · 2010-11-21 21:52 · Score: 1

Mainline Linux supports 4096 CPU cores for almost two years now.

--
:wq
Re:does it run Linux - yea but it is "boring" by houghi · 2010-11-21 21:54 · Score: 1

Boring? Just booting it with all the Tux images (one per core) would be sooo coool.

--
Don't fight for your country, if your country does not fight for you.
Re:does it run Linux - yea but it is "boring" by Bert64 · 2010-11-21 23:55 · Score: 1

Linux supports many more CPUS/Cores than that...
If you can afford one, SGI will happily sell you an Altix UV1000 which has 512 sockets and up to 2048 cores, running linux.
http://www.sgi.com/products/servers/altix/uv/
It also looks like they're planning to make even bigger versions of this system in the future.
I believe SGI has a few of these systems, or even bigger versions of this in the top500 list.

--
http://spamdecoy.net - free throwaway anonymous email - avoid spam!
Re:does it run Linux - yea but it is "boring" by TheRaven64 · 2010-11-22 00:02 · Score: 2, Informative

The current limit on Linux (with 2.6 series) is 8192 CPUs on POWER and 4096 on x86

That's kind-of true, but quite misleading. 8192 is the hard limit, but scheduler and related overhead means that the performance gets pretty poor long before then. Please don't cite the big SGI and IBM machines as counter examples. The SGI machines effectively run a cluster OS, but with hardware distributed shared memory. They are 'single system image' in that they appear to be one OS to the user, but each board has its own kernel, I/O peripherals and memory and works largely independently except when accessing data from a remote node (handled by the hardware) or migrating processes to another node (kernel initiates this when it's too heavily loaded on a single node).
The big IBM machines have a similar design, although their big supercomputers don't actually run Linux at all in a meaningful sense. They run no OS on the processors that do the real work - the big compute jobs run without anything interrupting them or competing for CPU time - and run Linux on the coprocessors that handle I/O.

--
I am TheRaven on Soylent News
Re:does it run Linux - yea but it is "boring" by __aayejd672 · 2010-11-22 00:51 · Score: 1

I think he hit one of those bottlenecks in the kernel
Re:does it run Linux - yea but it is "boring" by vojtech · 2010-11-22 01:06 · Score: 2, Interesting

Well, well, I hit the Submit button too soon. Anyway, most common workloads are already seeing decreasing benefits around 32 parallel threads.
Re:does it run Linux - yea but it is "boring" by pitchpipe · 2010-11-22 07:10 · Score: 1

...because it's fairly easy to hit bottlenecks either in the application or in the kernel that will limit scalability. Most common workloads are already seeing... [CPU limit reached. Reboot? (Y/N)]

--
Look where all this talking got us, baby.
Re:does it run Linux - yea but it is "boring" by Zed+Pobre · 2010-11-22 08:59 · Score: 2, Funny

Most common workloads are already seeing
What? Tell me. WHAT ARE THEY SEEING?
... problems with data truncation.
Re:does it run Linux - yea but it is "boring" by RAMMS+EIN · 2010-11-22 09:10 · Score: 1

``As Seymour Cray said: fast CPUs are easy, it's making fast /systems/ that's hard. You need good I/O to keep the CPUs fed.''
Absolutely. In almost everything I do that is slow, either I/O is the bottleneck, or the software is inefficient (i.e. the same operations could be implemented with much faster algorithms). This is why I rarely buy the fastest CPU I can get: the CPU isn't the bottleneck, so it's better to save there and spend where it's needed.

--
Please correct me if I got my facts wrong.

Message passing between cores? Hmm... by PaulBu · 2010-11-21 19:00 · Score: 3, Interesting

Are they trying to reinvent Transputer? :)

But yes, I am happy to see Intel pushing it forward!

Paul B.

Re:Message passing between cores? Hmm... by TinkersDamn · 2010-11-21 20:37 · Score: 2, Interesting

Yes, I've been wondering the same thing. Transputers contained key ideas that seem to be coming around again...
But a more crucial thing might be how much heat can you handle on one chip? These guys are already at 25-125 watts, likely depending on how many cores are actually turned on. After all they're playing pretty hefty heat management tricks on current i7's and Phenom's.
http://techreport.com/articles.x/15818/2
What use are 48 cores, let alone 1000 if they're all being slowed down to 50% or whatever by heat and power juggling?
Re:Message passing between cores? Hmm... by Anne+Thwacks · 2010-11-21 23:35 · Score: 3, Informative

They were apart from the comms protocol, which was a pile of poo.
IF YOU GOT A COMMS ERROR, THE ONLY RECOVERY MECHANISM WAS A TOTAL SYSTEM REBOOT.
That is as crap as you can get! TPP/IP might be an improvement, but HDLC would have cracked the Transputer's problems, and it was already over 15 years old when the transputer was invented.
Yes I did build a Transputer based system, and yes it did work. (but...)

--
Sent from my ASR33 using ASCII
Re:Message passing between cores? Hmm... by Joce640k · 2010-11-22 00:16 · Score: 2, Informative

I had a board with 4 T800s in my 286 PC, I wrote a raytracer for it.
The chips were OK but the compilers and development kit were terrible.

--
No sig today...
Re:Message passing between cores? Hmm... by vlm · 2010-11-22 00:46 · Score: 1

What use are 48 cores, let alone 1000 if they're all being slowed down to 50% or whatever by heat and power juggling?
No problemo, they'll be limited by external bandwidth and bulk storage bandwidth.
No matter if you use 1 core to decode all the frames or 600 cores each decoding one frame of video for the next ten seconds, you'll still only output one frame every 1/60 of a second, so heat load should remain constant.
In a similar way, you can only push so many requests thru an ethernet port, or so many HD drive accesses, etc.
The idea of being able to run a processor at 100% capacity forever might simply die. Each core will be thermally designed to sleep 99% of the time, although I guarantee marketing will focus on what all 1000 cores could theoretically do at magical 100% utilization for a couple milliseconds.

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
Re:Message passing between cores? Hmm... by ultranova · 2010-11-22 09:45 · Score: 1

No problemo, they'll be limited by external bandwidth and bulk storage bandwidth.

They'll be limited by memory bandwidth. Even current computers already are, and have been for quite some time. Because of that, and the heat problem, I suspect a future PC will be a NUMA cluster.

No matter if you use 1 core to decode all the frames or 600 cores each decoding one frame of video for the next ten seconds, you'll still only output one frame every 1/60 of a second, so heat load should remain constant.

Of course, that means that even current computers are plenty fast enough to do so, so it doesn't really make a good example.

The idea of being able to run a processor at 100% capacity forever might simply die. Each core will be thermally designed to sleep 99% of the time, although I guarantee marketing will focus on what all 1000 cores could theoretically do at magical 100% utilization for a couple milliseconds.

Unfortunately, that means that this processor is no more powerful than a properly-cooled 10-core machine. Also, having neighbouring cores activate and deactivate constantly will cause thermal stresses on the die, which distort it and play havoc on the fine-scale structures needed to fit all those cores on it.
Of course, we could simply develop more efficient transistors, or at least ones that can take more heat. Forget LEDs, the future CPU will glow on its own :).

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

Could be good for games using raytracing by mentil · 2010-11-21 19:01 · Score: 4, Insightful

This is for server/enterprise usage, not consumer usage. That said, it could scale to the number of cores necessary to make realtime raytracing work at 60fps for computer games. Raytracing could be the killer app for cloud gaming services like OnLive, where the power to do it is unavailable for consumer computers, or prohibitively expensive. The only way Microsoft etc. would be able to have comparable graphics in a console in the next few years is if it were rental-only like the Neo-Geo originally was.

--
Corruption is convincing someone that the selfless ideal is the same as their selfish ideal.

Re:Could be good for games using raytracing by xtal · 2010-11-21 22:56 · Score: 1

Yay! You've just guaranteed me 1000 core video cards for 2011. :D

--
..don't panic
Re:Could be good for games using raytracing by Narishma · 2010-11-22 01:23 · Score: 1

I've never understood what benefits raytracing has for gaming. Every demo of it I've seen has looked very poor compared to how it's done currently. It pretty much always involves way too many mirrors and shiny surfaces which look too synthetic and aren't of much use in an actual game anyway.

--
Mada mada dane.
Re:Could be good for games using raytracing by level_headed_midwest · 2010-11-22 03:32 · Score: 1

I'd say that's accurate since NVIDIA's latest behemoth, the GTX580, has 512 SIMD cores. The next die shrink ought to bring the counts to near 1000 SIMD cores.

--
Just "gittin-r-done," day after day.
Re:Could be good for games using raytracing by LUH+3418 · 2010-11-22 05:21 · Score: 1

The benefit is that raytracing is a more natural way to do 3D rendering (by simulating light). It basically gives you a unified model of rendering. For programmers, this means almost every effect can be done very simply, and in a more physically realistic manner. Shiny surfaces, shadows, mirrors, are actually trivial to do with raytracing, but with current hardware (which performs rasterization), those effects are all hard to do, usually involving multiple rendering passes and dirty hacks. The end result is also less realistic. One problem that is *very hard* to tackle (I find) using rasterization, is real-time lighting, and again, it's very easy to do with raytracing.

Current efforts at real-time raytracing are limited because there isn't that much research into the topic, compared to the billions of R&D that went into rasterizing 3D accelerators. Lots of shiny surfaces is also a combination of poor artistic choices and trying too hard to demo the technical capabilities available. What I've seen involved either doing all the work on the CPU, or designing custom FPGAs to do hardware acceleration with a fraction of the computational power that a modern GPU has.

Still, in the long run, raytracing is *the way* to get the most realistic graphics possible. Raytracing is what's used by the the most realistic software rendering packages available today. Not saying you'll have graphics like these in real-time super soon, but it's worth looking forward to:

Rendered using VRay
Mental Ray
Mental Ray
VRay
Re:Could be good for games using raytracing by fiannaFailMan · 2010-11-22 06:30 · Score: 1

This is for server/enterprise usage, not consumer usage.
640K should be enough for everybody.

--
Drill baby drill - on Mars
Re:Could be good for games using raytracing by iinlane · 2010-11-22 07:32 · Score: 1

The cores are overkill for raytracing, the stream processors found in current gen video cards are more suitable for the task. 1000 cores won't be enough as current video cards already support that many cores on single card but it's not nearly enough for anything but technology demos.
Re:Could be good for games using raytracing by ultranova · 2010-11-22 09:51 · Score: 1

Yay! You've just guaranteed me 1000 core video cards for 2011. :D

Radeon 5870 already has 1600 shader cores.

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

Obligatory XKCD by Anonymous Coward · 2010-11-21 19:02 · Score: 2, Funny

http://xkcd.com/619/

Temperature? by garompeta · 2010-11-21 19:03 · Score: 1

Would the temperature raise 1000 times more than now?
(Would we need cryogenic coolers?)

Re:Temperature? by c0lo · 2010-11-21 19:13 · Score: 1

TFA:

The chip, first fabricated with a 45-nanometer process at Intel facilities about a year ago, is actually a six-by-four array of tiles, each tile containing two cores. It has more than 1.3 billion transistors and consumes from 25 to 125 watts.

--
Questions raise, answers kill. Raise questions to stay alive.
Re:Temperature? by TapeCutter · 2010-11-21 19:50 · Score: 2, Funny

1.3 billion transitors!!! When I was a kid we had 9 and you could open the box and count 'em.

--
And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
Re:Temperature? by c0lo · 2010-11-21 20:40 · Score: 4, Interesting
Dude, what the fuck, that's only 48 cores. How does that get you anywhere close to 1000?
Well, Watson, that's elementary...
- The correct question should have been: "How many watts one needs to dissipate"... because the temperature is given by "How high and still have the transistors working".
- In regards with the power dissipation: the architecture would have a common component (event passing, RAM fetches, etc) and N cores. Assuming each core needs to dissipate the same power (say, at peak utilization) and assuming the 25-125 Watts being the range defined by "1 core used" to "all 48 cores used", some simple linear algebra gives: power dissipated/core approx 2 watts (a bit more actually) with the "common component" eating approx 23 Watts.
  Therefore, on top of the computation benefits derived from fully utilizing 1000 cores, one would have a pretty good heat source: 2150 Watts or so. One's choice what to do with it, but it's far too high for a domestic-sized slow cooker (the dished would come with a weird burned taste).
Satisfied, now?
If not, to put the things in perspective, assuming our ancestors (that could use only horses as a source of power) would have wanted to use this computer, they's need approx. 2.68 horses... but hey, wow... what a delight to play the MMORPG so smooth... especially in "farming/grinding" phases.
PS. the above computations are meant to be funny and/or an exercise of approximating based on insufficient data and/or vent some frustration caused by "all work and no play", definitely a wasted time... Ah, yes, some karma would be nice, but not mandatory.
--
Questions raise, answers kill. Raise questions to stay alive.
Re:Temperature? by level_headed_midwest · 2010-11-22 03:34 · Score: 1

1.3 billion transistors isn't even all that impressive any more. GPUs crossed that line years ago and a current top-line GPU has about 3 billion transistors. CPUs are also well above that, with Intel's 8-core Xeon 7500s clocking in at 2.3 billion transistors and AMD's 12-core Opteron 6100s having 1.81 billion transistors.

--
Just "gittin-r-done," day after day.
Re:Temperature? by iinlane · 2010-11-22 07:37 · Score: 1

Therefore, on top of the computation benefits derived from fully utilizing 1000 cores, one would have a pretty good heat source: 2150 Watts or so.
Install 10 radiators and call it Veyron because it soo fast!
Re:Temperature? by TapeCutter · 2010-11-22 20:08 · Score: 1

It was a joke, in the 60's people advertised "transistor radios" to distinguish them from the old valve technology and often told you how many transitors there were indside it. A "9 transitor radio" was a common selling point similar to how people still sometimes advertise how many "jewels" are inside a clockwork wrist watch. The practice died out when IC's became common in the 70's.

--
And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.

Bring out your Memes! by SixDimensionalArray · 2010-11-21 19:07 · Score: 4, Funny

Imagine a Beowulf cluster of th^H^H^H

Ah, forget it, the darn thing practically is one already! :/

"Imagine exactly ONE of those" just doesn't sound the same.

Re:Bring out your Memes! by rrohbeck · 2010-11-21 20:55 · Score: 3, Funny

I've said it for years: 640K cores ought to be enough for anybody.

--
thegodmovie.com - watch it
Re:Bring out your Memes! by rrohbeck · 2010-11-22 19:45 · Score: 1

Who cares as long as my core count is bigger than yours.

--
thegodmovie.com - watch it

1,000,000 cores! by EricX2 · 2010-11-21 19:10 · Score: 1

Why have 1000 cores when you can have 1 MILLION CORES, (all running applications that can barely take advantage of 1 or 2)

Re:1,000,000 cores! by Macrat · 2010-11-21 19:26 · Score: 1

Why have 1000 cores when you can have 1 MILLION CORES, (all running applications that can barely take advantage of 1 or 2)
Your computer only runs 1 application at a time?
Re:1,000,000 cores! by wvmarle · 2010-11-21 19:31 · Score: 1

While scalable from a computing pov (data exchange, addressing, whatnot) I can imagine that it's not scalable from a physical pov: power supply, size, heat dissipation, and getting your signals to and from the chip over longer and longer distances.
The last part is getting an issue already due to the long cable problem: at 3 GHz, a signal travels only about 10 cm before the next signal is produced. One core communicating with another over a distance of just 5 cm would have the problem that the data from one core arrives only halfway the cycle to the next core.
Re:1,000,000 cores! by jimicus · 2010-11-21 22:26 · Score: 1

Why have 1000 cores when you can have 1 MILLION CORES, (all running applications that can barely take advantage of 1 or 2)
Your computer only runs 1 application at a time?
No, but it certainly doesn't run 1000 applications at a time. And even if it did, I want my application to finish more quickly. If it's using 98% of the available CPU time on one core, it won't get dramatically quicker if the other 2% is farmed out to other processors.
As we all know, the solution is to rewrite my application so it lends itself to multi-processing more efficiently. Great if you're working with a problem which has a multi-processor-friendly solution (which may not currently be the solution currently implemented for whatever reason), not so great if it doesn't.
Re:1,000,000 cores! by TheRaven64 · 2010-11-22 00:06 · Score: 1

Most of the time, yes. Of course, it also has a few dozen others blocked waiting for I/O...

--
I am TheRaven on Soylent News
Re:1,000,000 cores! by mwvdlee · 2010-11-22 00:36 · Score: 1

It may not run 1000 applications at a time, but is that because the OS was designed to combine multiple different tasks into single applications?
An OS could easily split out daemon tasks into smaller units and you'd go a long way to using up the 1000 cores. May not be an efficient strategy, but it may be worth considering if you've got a 1000 cores hanging around.

--
Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?

accurate representation by pyronordicman · 2010-11-21 19:12 · Score: 5, Interesting

Having been in attendance of this presentation at Supercomputing 2010, for once I can say without a doubt that the article captured the essence of reality. The only part it left out is that the interconnect between all the processing elements uses significantly less energy than that of the previous 80-core chip; I think the figure was around 10% of chip power for the 48-core, and 30% for the 80-core. Oh, and MPI over TCP/IP was faster than the native message passing scheme for large messages.

How many... what's next? by c0lo · 2010-11-21 19:12 · Score: 1

"It's a lot harder than you'd think to look at your program and think 'how many volts do I really need?'" he [Mattson] said.

First was RAM (640kb should be... doh), then M/GHz, then Watts, now is volts... so, what's next?
(my bet... returning to RAM and the advent of x128)

--
Questions raise, answers kill. Raise questions to stay alive.

Re:How many... what's next? by NoZart · 2010-11-22 00:17 · Score: 1

kilocore, megacore, gigacore?
Re:How many... what's next? by swilly · 2010-11-22 04:55 · Score: 1

kibicore, mebicore, gibicore?
Fixed that for you.

Workaround. by miffo.swe · 2010-11-21 19:13 · Score: 1

Am i the only one feeling this is just a foray into multicore chips because they hit a brick wall when it comes to faster single core CPUs? While i like the thought of say 8 cores or something id much rather have those 8 cores being faster than having a frigging supercomputer under my desk.

--
HTTP/1.1 400

Re:Workaround. by miffo.swe · 2010-11-21 20:06 · Score: 1

Not really a supercomputer sitting under my desk by any means unless you talk about a supercomputer from over 20 years ago, and even then it wouldnt be as fast at I/O.
The supercomputers of 2000 are pretty impressive if you ask me and we are nowhere near getting perfomance even remotely like those for a long time.
http://www.top500.org/list/2000/11/

--
HTTP/1.1 400

Re:I can put a thousand cores on a chip... by smash · 2010-11-21 19:14 · Score: 1

just throw more cache at it :D

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.

Workaround, yeah by fnj · 2010-11-21 19:21 · Score: 1

Er, yeah, pretty much everyone knows they have no practical way to make the clock speed much faster. The only thing they can do is proliferate cores beyond all reason. Nobody has the slightest idea how to take advantage of that many cores in normal household use and even most workstation use.

Re:Workaround, yeah by miffo.swe · 2010-11-21 19:55 · Score: 1

Lets hope some more work is put into light based computers. Those hold the promise of much faster single core CPUs. Current efforts seems wasted when you go beyond 10-15 cores. Mostly you'll only have a bunch of dead cores waiting for something useful to do, while not being able to help much when you really need a spike in CPU power.

--
HTTP/1.1 400
Re:Workaround, yeah by wierd_w · 2010-11-21 20:19 · Score: 5, Informative

You've obviously never worked in Aerospace.
I can bring a quad core Xeon system to its knees running Catia. (I mean, 100% saturation, all 4 cores, with IO contention.) I do it fairly regularly too.
Might have something to do with the NP-Hard problem of resolving tangencies on extremely complex nurbs surfaces. (aircraft skins).
Granted, that is not a "normal" workstation; But I would be VERY happy indeed to have a 1000 core workstation at my disposal. Maybe then I could actually work with Gulfstream's horrible part models where they include literally the whole god-damn aircraft's surface geometry in the digital part model for a fucking bolt. (Guess what happens when you load several such models, and digitally assemble them. I have seen a 64 bit workstation allocate over 8gb of swap because of them and their dumbassery.)
Now, if I could get one with over 1TB of RAM installed too, then I'd be in business.
Re:Workaround, yeah by MichaelSmith · 2010-11-21 22:34 · Score: 4, Interesting

In my field it would be real time conflict detection between aircraft. The better your conflict detection, the more aircraft you can pack in to small volumes of space. There is a lot of money in that.

--
http://michaelsmith.id.au
Re:Workaround, yeah by Vectormatic · 2010-11-22 00:44 · Score: 1

You've obviously never worked in Aerospace.
or any kind of real-time data processing. My current project deals with routing, processing aggregating and archiving traffic measurement data from nearly 100.000 sensors each minute, in XML form. Granted, the data processing itself isnt that much of an issue, but the amount of database acces we generate doing this takes about 5 xeon cores going full bore just to keep up with the live data-feed for aggregating, never mind the 9 months of data-backlog we have to work through.
Not using the most brain-dead scripting language ever invented would have helped though, but as luck has it, the built in database somewhat compensates. I just wish we could do a full rebuild in java+oracle or something

--
People, what a bunch of bastards
Re:Workaround, yeah by theGreater · 2010-11-22 06:47 · Score: 1

Now, if I could get one with over 1TB of RAM installed too, then I'd be in business.
If you're serious about this and your time is worth enough money, try a Dell r910 with 48 threads, 1TB of RAM, and 16 SSDs. More than enough PCIe I/O to add in a bunch of GPU's.
Re:Workaround, yeah by iinlane · 2010-11-22 07:42 · Score: 1

Maybe then I could actually work with Gulfstream's horrible part models where they include literally the whole god-damn aircraft's surface geometry in the digital part model for a fucking bolt.
1000 core processor is not a cure for human ignorance.
Re:Workaround, yeah by wierd_w · 2010-11-22 11:50 · Score: 1

The problem is that the geometry "kinda" is needed by the bolt's geometry though inheritance.
The way Catia works, is you can generate a surface, then use that surface to do boolean subtractions/additions from other surfaces/volumes. They do this to make parts that interface perfectly with another skin.
The problem, is that the bolt is modeled in place, around a hole, modeled in a part that mates to a surface, which is derived from a split created from the skin of the main body, as well as surfaces from about a dozen other parts. This means that the hole's inside edges are included in the bolt's model, which gets geometry imported from the part it interfaces to, on and on up the heirarchy.
As such, the bolt has external dependencies that are all over the damn place.
Catia has measures to help you "Clean" your models, by isolating geometry and replacing this long chain of inheritance with a 'datum' object, which GREATLY reduces overhead, but Gulfstream simply chooses not to do this, because the people in their engineering departments are incompetent/morons. (Don't even get me started on their lack of understanding of basic GD&T concepts.)
The software is not really at fault. BOEING uses the same software, and their models are sane and useful, BECAUSE they have a sanitizing process they go through before releasing digital part models that cleans out all these down-stream dependencies. (both for optimization purposes, as well as security ones. Imagine the amount of harm that could be done if I were to "leak" a part model for a bolt, that also happens to contain the entire outer skin definition of a government aircraft. It's a NURBS surface that is the exact mathematically ideal condition for the skin, and thus could be used to derive its radar profile, among other not so good things.) It is part of their "Publication" process. Gulfstream? Not so much.
Personally, I prefer to avoid using whole sections of the outer skin to do splits, and create localized datum features of just the minimum contact areas in question, greatly reducing the overhead of using the models I create. I try to make use of isolated "datum" geometry whenever possible, since it reduces the number of things that can break down the line anyway, and I don't like making work for myself to deal with later.
When it comes to getting Gulfstream to stop being stupid though, I am merely an office jockey, and while I have raised issues with their engineering department, this particular issue ALWAYS falls on deaf ears, because it would require them to change their release policies in a non-trivial manner. In short, the MBAs up the tree would never go for it, despite the obvious benefits it has downstream. It's like herding cats; it's a futile effort to complain. They'll just find another manufacturing partner to do their outsourced work that wont complain, because that's cheaper than fixing the problem.
As such, I have to settle for the "severely sub-optimal" solution of "Dealing" with the shit they send me.

Yep! except for... by PaulBu · 2010-11-21 19:22 · Score: 1

Intel would not have (presumably!) to re-invent *Intel* Paragon! :)

We can throw a Connection Machine in there, and really date ourselves -- but it's still nice to know that finally CMOS tech has caught up with late 80s comp. arch. advances!

And then, do not get me started on the original Tera, with its multithreading it seemed to be much better bang for the buck of chip real estate than currently accepted multicore solutions. But what would I know...

Paul B.

And here goes RAM bandwidth by snikulin · 2010-11-21 19:22 · Score: 1

Again...
Alternatively, NUMA on a single CPU (different memory channels connected to different cores).
It would be a bitch to program (but fun nevertheless).

gpu's have been doing this for years... by Anonymous Coward · 2010-11-21 19:23 · Score: 1, Interesting

given that for years GPU's have hand hundreds of processors (the power of CUDA is awesome!) this is long over due by lazy CPU designers like Intel....

Re:gpu's have been doing this for years... by lennier1 · 2010-11-21 19:39 · Score: 1

True. CPUs like that would be a godsend for tasks like 3D rendering (entertainment industry, architectural visualization, ...).
Re:gpu's have been doing this for years... by wierd_w · 2010-11-21 20:22 · Score: 1

Aerospace mockup and simulation... (Imagine, a "down to the last bolt" NURBS model, with dynamic stress simulation... Can theoretically be done now, but I have never seen a workstation in any production engineering department get closer than perhaps a wing segment before crashing the workstation.)
I greedily await such a future.
Re:gpu's have been doing this for years... by lennier1 · 2010-11-21 20:55 · Score: 1

Here's an example of how it's used in the entertainment sector:
http://www.youtube.com/watch?v=JhJauu_vB2A
Basically linking a 3D suite up to a camera to get motion data, reference point positioning and other information to allow more seamless integration and even low-quality previews during the shoot. It's the technique from "Avatar", which was combined with two-stage motion capturing to make those shots possible ( http://www.wired.com/magazine/2009/11/ff_avatar_5steps/ ).
Even with twin-hexacore workstations on the market and using GPU-based processing as well they're still in dire need for more.
Re:gpu's have been doing this for years... by Bengie · 2010-11-22 01:45 · Score: 2, Informative

"Lazy CPU designers" hah!
GPUs are severely limited on their types of tasks. Instead of a 1600 core GPU, pretend your CPU has as single large SIMD register that can hold 1600 floats. Now, it would be great at crunching large matrices of floats and utterly suck at everything else. That's a GPU in a nutshell. GPUs are absolutely horrible at branches. If one core takes a branch, every core in that core's group must stall and wait for the branch to finish. All cores must be working on the same instruction at the same time and branches mess with that. GPUs != CPUs
When Intel last talked about their 80 core CPUs, they talked about getting rid of cache-coherency in order to scale, AMD also recognizes this. This would mean OSs and most Apps would NOT be backwards compatible even if still using x86 instructions. Although, it would be possible to run apps in a VM that emulated cache-coherency.

Biggest Hurdle Not Cores by Lokeh · 2010-11-21 19:25 · Score: 1

I took an intro to ECE class last fall that was basically just a parade of people coming in and talking about the kinds of things that they do as an engineer. One of the speakers talked about how one could have all of these cores, but that coding to take advantage of all of them was such a difficult task that it's hard to find any software that takes advantage of the few cores we're shipping today, let alone a hundred cores or a thousand cores. Apparently he was working on a project - a sort of wrapper? I think he mentioned AI but I don't know if he was just blowing smoke up our ass at that point - to help streamline writing for thousands of cores. I don't know how much truth is in that but I found it interesting, and would love to hear from someone who actually codes these kinds of things.

Re:Biggest Hurdle Not Cores by pyronordicman · 2010-11-21 19:36 · Score: 1

It's true that many desktop/server applications don't have the parallelism available to make use of many cores (i.e. > 2). However, this chip was designed with scientific applications in mind, where thousands, if not millions, of calculations can be executed simultaneously. Many of these problems are readily mapped to programming models that take advantage of many cores, such as message passing or SIMD/vector processing. For those programs that don't have available parallelism, there's not a whole lot to do with extra processing power. You can sometimes speculatively execute code, but that's a tricky problem for the compiler and runtime to figure out.
Re:Biggest Hurdle Not Cores by Anonymous Coward · 2010-11-21 19:38 · Score: 3, Insightful

Basically, we are going to need compilers that automatically take advantage of all that parallelism without making you think about it too much, and programming languages that are designed to make your programs parallel-friendly. Even Microsoft is finally starting to edge in this direction with F# and some new features of .NET 4.0. Look at Haskell and Erlang for examples of languages that take such things more seriously, even if the world takes them less seriously.
I don't know about AI, but almost certainly we will end up with both compilers and virtual machines that are aware of parallelism and try to take advantage of it whenever possible.
But still, certain algorithms just aren't very friendly to parallelism no matter what technology you apply to them.

I hope he works for Mr. Coffee or GE. by Anonymous Coward · 2010-11-21 19:26 · Score: 1

with all that heat, it would be nice to have a skillet that could cook a samwitch or eggs or brew coffee. I lived on a Mr. Coffee machine for over 3 years of boiling vegetables or tea and my only regret is it while keeping the room warm and the occassional hot towel bath it would have been nice if it's heat source was from an embedded computer rather a wastefull heating element. I know some people used a self-throttling Pentium 4 to boil food from their waterblock and such. Why not?

Re:One question? by Macrat · 2010-11-21 19:28 · Score: 1

Just how small does your penis need to be to need a 1,000 cores?

That's what it takes to run Flash these days.

You wanna impress me? by Anonymous Coward · 2010-11-21 19:30 · Score: 2, Funny

Make a processor with four asses.

Future of Programming by igreaterthanu · 2010-11-21 19:31 · Score: 4, Interesting

This just goes to show that if you care about having a future career (or even just continuing with your existing one) in programming, Learn a functional language NOW!

--
I dream of a nation where a man is not judged by his skin color but by an number assigned by a credit rating agency.

Re:Future of Programming by jamesswift · 2010-11-21 20:37 · Score: 2, Interesting

It's quite something isn't it, how so few people on even slashdot seem to get this. Old habits die hard I guess.
Years ago a clever friend of mine clued me into how functional was going to be important.
He was so right and the real solutions to concurrency (note, not parallelism which is easy enough in imperative) are in the world of FP or at least mostly FP.
My personal favourite so far is Clojure which has the most comprehensive and realistic approach to concurrency I've seen yet in a language ready for real world work.
The key thing to learn from it is how differently you need to approach your problem to take advantage of a mutli-core world.
Clojure itself may never become a top-5 language but they way it approaches the problem surely will be seen in other future FP langs.

--
i wish i could stop
Re:Future of Programming by Anonymous Coward · 2010-11-21 20:43 · Score: 5, Insightful

Learn a functional language. Leanr it not for some practical reason. Learn it because having another view will give you interesting choices even when writing imperative languages. Every serious programmer should try to look at the important paradigms so that he can freely choose to use them where appropriate.
Re:Future of Programming by rrohbeck · 2010-11-21 20:59 · Score: 1

All you need is a library that gives you worker threads, queues and synchronization primitives. We've all learned that stuff at some point (and forgot most of it.)

--
thegodmovie.com - watch it
Re:Future of Programming by loufoque · 2010-11-21 21:16 · Score: 1

Sorry, but while functional programming style is indeed the future of HPC (with C++), functional languages themselves aren't. Read the research papers of the field and see for yourself.
Re:Future of Programming by igreaterthanu · 2010-11-21 21:34 · Score: 1

Learn it because having another view will give you interesting choices even when writing imperative languages.
Amen! Especially if you are going to use parallelism libraries in imperative languages, the ability to think differently is crucial and gives a lot of insight.

--
I dream of a nation where a man is not judged by his skin color but by an number assigned by a credit rating agency.
Re:Future of Programming by JAlexoi · 2010-11-21 22:13 · Score: 1

Bollocks! And donkeycock!
It's one of those fallacies that your code has to scale to sizes of Facebook. Read less high scalability blogs because it's affecting you....
Your code will be rewritten 100 times in different languages before it will be the centrepiece of anything at Google's scale...
Re:Future of Programming by hughperkins · 2010-11-21 23:39 · Score: 1

I spent some time looking at a few just in case, on the basis that it's probably much easier to learn now, whilst younger, rather than in 10 or 20 years, or whenever they happen to become important.
Haskell monads seem to me to be pretty tricky to get one's head around.
I suspect that if and when fp becomes mainstream, in the way that Java and C# are right now for example, they will be much easier to understand; but I imagine many of the concepts from Haskell et al will stay the same.
Note that a whole bunch of fps use only a single core for now. eg Erlang uses only a single-core out of the box at the moment, unless something has changed since I last checked. Lots of threads sure, but they all run on the same core...
Re:Future of Programming by Twinbee · 2010-11-21 23:47 · Score: 1

Can you give me the most well known app/apps that have been programmed with a functional language? As far as I know, they still go the imperative route.

--
Why OpalCalc is the best Windows calc
Re:Future of Programming by TheRaven64 · 2010-11-22 00:11 · Score: 1

ZeroMQ is not a programming model, it's a library. Pi-calculus, communicating sequential processes, actor model, and transactional memory are all examples of models for concurrent programming. ZeroMQ is not.

--
I am TheRaven on Soylent News
Re:Future of Programming by TheRaven64 · 2010-11-22 00:12 · Score: 1

Incidentally, the second group is responsible for most of the bad press surrounding C++, Java, and C# (C++ especially).
I'm fairly sure that the C++ standards committee is responsible for most of the bad press surrounding C++. It doesn't matter how many other languages you learn, C++ still doesn't look any better.

--
I am TheRaven on Soylent News
Re:Future of Programming by TheRaven64 · 2010-11-22 00:13 · Score: 1

Haskell monads seem to me to be pretty tricky to get one's head around.
Until you start using them for transactional memory, then you wonder how you ever wrote scalable code without them.

--
I am TheRaven on Soylent News
Re:Future of Programming by TheRaven64 · 2010-11-22 00:15 · Score: 1

Threads are a really bad way of writing scalable applications. They are a shared-everything model, meaning that the number of interactions that you have to reason about is the number of potentially-shared objects in your system, multiplied by the number of threads, multiplied by the number of states per thread.

--
I am TheRaven on Soylent News
Re:Future of Programming by TheRaven64 · 2010-11-22 00:18 · Score: 1

Functional programming style isn't the important bit. The advantage that functional programming has is the ability for the compiler to automatically parallelise it. If your code has no side effects, the compiler can profile each call, then run the expensive ones in parallel and use monads wrapping STM to add implicit ordering at the end.
If you write the same code in an imperative programming language, then the compiler does not know that your functions have no side effects, so you need to provide various annotations (OpenMP / HMPP), which gives another place for bugs to hide.

--
I am TheRaven on Soylent News
Re:Future of Programming by TheRaven64 · 2010-11-22 00:26 · Score: 1

That's a terrible argument. By that logic, no one would have adopted C (most of the well-known apps are written in assembly language), or any high-level language.
That said, the most popular implementation of XMPP (Jabber) is eJabberd, which is written in Erlang. Erlang is a hybrid language, which implements the communicating sequential processes formalism for concurrency and functional programming within each process. It scales incredibly well and has been used for a lot of large-scale deployments.

--
I am TheRaven on Soylent News
Re:Future of Programming by Twinbee · 2010-11-22 01:21 · Score: 1

Not so terrible considering Haskell et al. has been around quite a while now (since 1990), though admittedly, GPGPU isn't nearly as old. Still, I would expect to see at least one I knew. Only barely heard of Jabber.

--
Why OpalCalc is the best Windows calc
Re:Future of Programming by DoofusOfDeath · 2010-11-22 01:52 · Score: 1

The necessity of functional programming when presented with highly parallel computers has been argued for literally decades, and yet it still hasn't taken over the world.
It's not really clear to me why imperative languages have such sticking power in HPC programming, but they do.
Re:Future of Programming by TheRaven64 · 2010-11-22 02:42 · Score: 1

Haskell has been around for a while, but fast implementations (GHC) are still very recent. It's also only been important for code to be scalable for a little while.
I suspect that you can't name a single well-known app that's written in Fortran, but the language still completely dominates HPC. Erlang is used in quite a lot of telecoms switching systems too. You've probably made telephone calls that have been routed by software written in Erlang, but you'd be completely unaware of it.
Using a language that is well suited to parallelism usually comes with a trade. Erlang code, for example, tends to perform worse than equivalent C code on a single processor. The advantage is that you can throw more cores at it and watch it get faster. Until about 5 years ago, pretty much the only machines with more than one core were expensive server or workstation class systems, which were a very small market. Even these only had 2-4 cores, and that wasn't really enough for scalability to be more important over single-threaded speed. When you start getting 16 or so cores, a 50% single-threaded performance penalty might be well worthwhile if it lets you scale up to all of the cores easily. Until multicore became common, no one cared about parallelism for mainstream applications, so language like C were he obvious choice.
Something like eJabberd can almost make use of as many cores as there are concurrent clients. It is also trivial to deploy on a cluster (rather than a single machine), so if you get more clients than a single machine can handle you can just throw more hardware at it.

--
I am TheRaven on Soylent News
Re:Future of Programming by ultranova · 2010-11-22 10:22 · Score: 1

All you need is a library that gives you worker threads, queues and synchronization primitives.

And then either your application runs very slowly, because you synchronize too much and get lock contention or even deadlocks, or contains lots of bugs that are near-impossible to debug because you started doing "clever" things with synchronization.
A multithreaded program is for all intents and purposes like an OS without memory protection, and inherently unstable for the exact same reason: everything is shared unless specifically declared nonshared, and there's no way to enforce the rules. It's also slow because of lock contention at both application and CPU level. We need a better model, where only the things that need to be shared are shared and errors can be tolerated and recovered from. A transactional memory system, perhaps?

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:Future of Programming by NemoinSpace · 2010-11-22 12:33 · Score: 1

funny, i get the opposite impression. It seems to me any point in my career would have been as good a starting point as any other.
Re:Future of Programming by petermgreen · 2010-11-23 07:49 · Score: 1

Until multicore became common, no one cared about parallelism for mainstream applications, so language like C were he obvious choice.
And even now the slowest machines your users are likely to have are single cores. This matters because for a lot of programmers the aim is to make it run acceptablly on the slowest machines thier users use.

--
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register

Re:Imagine by JWSmythe · 2010-11-21 19:33 · Score: 2, Interesting

Why? :) I know. meme. It's just, I've built a couple Beowulf clusters for fun, and didn't have an application written to use MPI (or any of the alphabet soup of protocols), so it was just an exercise, not for any practical use. It's not like most of us are crunching numbers hard enough to need one, and it won't help out playing games or even building kernels.

I'd like to see a 1k core machine on my desktop, but that's beyond the practical limits of any software currently available. Linux can only go to 256 cores. Windows 2008 tops out at 64. But hey, if they did come to market, I know who would be first to support all those cores, and it doesn't come from Redmond (or their offshore outsourced developers).

--
Serious? Seriousness is well above my pay grade.

I wonder what kinda cooling the thing needs by qzhwang · 2010-11-21 19:33 · Score: 1

Would be interesting to know if this helps with performance/power ratio against (potential many core/cpu) ARM servers.

Re:One question? by JWSmythe · 2010-11-21 19:37 · Score: 2, Insightful

The only thing I'd be compensating for is the fact I can't do calculations at Exaflop rates in my head.

Just like my car only compensates for the fact I can't run at 165mph. :)

--
Serious? Seriousness is well above my pay grade.

1000 cores is easy! by Jason+Kimball · 2010-11-21 19:46 · Score: 5, Funny

1000 cores on a chip isn't too bad. I already have one with 110 cores.

That's only 10 more cores!

Re:1000 cores is easy! by prider · 2010-11-21 21:52 · Score: 2, Funny

Only 10 types of people caught that..
Re:1000 cores is easy! by gmhowell · 2010-11-21 22:33 · Score: 1

That has got to be the funniest thing I've read here in a month.

--
Jesus was all right but his disciples were thick and ordinary. -John Lennon
Re:1000 cores is easy! by roman_mir · 2010-11-22 00:26 · Score: 3, Funny

That has got to be the funniest thing I've read here in a month.
- jesus, that must have been one sad month.

--
You can't handle the truth.
Re:1000 cores is easy! by tkjtkj · 2010-11-22 03:25 · Score: 1

yes.

--
"There are 11 kinds of people: those who know binary, those who don't, and those who could not care less!"
Re:1000 cores is easy! by Jason+Kimball · 2010-11-22 10:17 · Score: 1

getting beef for crossposting this same comment on another awesome aggregation website
7-methyltheophylline should be more trusting.
Re:1000 cores is easy! by gmhowell · 2010-11-22 18:40 · Score: 1

That has got to be the funniest thing I've read here in a month.
- jesus, that must have been one sad month.
Sad website.

--
Jesus was all right but his disciples were thick and ordinary. -John Lennon

Inter-core communication? by VincenzoRomano · 2010-11-21 19:46 · Score: 1

I wonder how the inter-core communication will scale without packing 1000+ layers in the die.

--
Maybe Computers will never be as intelligent as Humans.
For sure they won't ever become so stupid. [VR-1988]

Instruction set... by KonoWatakushi · 2010-11-21 19:47 · Score: 3, Insightful

"Performance on this chip is not interesting," Mattson said. It uses a standard x86 instruction set.

How about developing a small efficient core, where the performance is interesting? Actually, don't even bother; just reuse the DEC Alpha instruction set that is collecting dust at Intel.

There is no point in tying these massively parallel architectures to some ancient ISA.

Re:Instruction set... by Arlet · 2010-11-21 20:53 · Score: 1

There's also no reason to throw away an ISA that has proven to be extremely scalable and very successful, just because it's ancient or it looks ugly.
The advantage of the x86 instruction set is that it's very compact. It comes at a price of increased decoding complexity, but that problem has already been solved.
The low number of registers is not a problem. In fact, it may even be an advantage to scalability. A register is nothing more than a programmer-controlled mini cache in front of the memory. I'd rather have few registers, and go directly to memory. The hardware can then scale to include bigger and faster caches, so that memory access is just as fast a register access, without the software having to deal with register allocation and save/restore.
Re:Instruction set... by loufoque · 2010-11-21 21:18 · Score: 1

Just take a look at tilera. It's not open though.
Re:Instruction set... by Splab · 2010-11-21 21:42 · Score: 1

Err, did you just claim cache is as fast as a register access?
Re:Instruction set... by kohaku · 2010-11-21 21:47 · Score: 4, Insightful

There's also no reason to throw away an ISA that has proven to be extremely scalable and very successful, just because it's ancient or it looks ugly.
Uh, scalable? Not really... The only reason x86 is still around (i.e. successful) is because it's pretty much backwards compatible since the 8086- which is over THIRTY YEARS OLD.

The advantage of the x86 instruction set is that it's very compact. It comes at a price of increased decoding complexity, but that problem has already been solved.
Whoa nelly. compact? I'm not sure where you got that idea, but it's called CISC and not RISC for a reason! if you think x86 is compact, you might be interested to find out that you can have a fifteen byte instruction In fact, on the i7 line, the instructions are so complex it's not even worth writing a "real" decoder- they're translated in real-time into a RISC instruction set! If Intel would just abandon x86, they could reduce their cores by something like 50%!
The low number of registers _IS_ a problem. The only reason there are only four is because of backwards compatability. It definitely is a problem for scalability, one cannot simply rely on a shared memory architecture to scale vertically indefinitely, you just use too much power as a die size increases, and memory just doesn't scale up as fast as the number of transistors on a CPU.
A far better approach is to have a decent model of parallelism (CSP, Pi-calculus, Ambient calculus) underlying the architecture and to provide a simple architecture with primitives supporting features of these calculi, such as channel communication. There are plenty of startups doing things like this, not just Intel, and they've already products in the market- though not desktop processors. Picochip and Icera to name just a couple, not to mention things like GPGPU (Fermi, etc.)
Really, the way to go is small, simple, low power cores with on-chip networks which can scale up MUCH better than just the old intel method of "More transistors, increase clock speed, bigger cache".
Re:Instruction set... by Anonymous Coward · 2010-11-21 22:33 · Score: 1, Insightful

The advantage of the x86 instruction set is that it's very compact. It comes at a price of increased decoding complexity, but that problem has already been solved.
Wrong and wrong. The x86 instruction is hardly compact with all its redundancy and roundabout inconsistent operations. ARM Thumb coding is superior in just about every way.
Decoding is also not "solved", it is minimised due to intense optimisation and pre-decoding instructions before they are placed in the cache but that still harms the efficiency of the pipeline (branch misprediction is lethal).

The low number of registers is not a problem. In fact, it may even be an advantage to scalability. A register is nothing more than a programmer-controlled mini cache in front of the memory.
You keep using that word, I don't think it means what you think it means.
The fact that you think x86 is scalable is laughable. You may want to do some research into cache coherency, specifically how for every CPU added, the cache protocol becomes less and less efficient until every physical core you add makes the system slower as every existing CPU is too busy arguing with each other over who owns what cache lines to actually do any computation. The only solution is either to throw out cache coherency which is exactly what most competing ISAs have done or partition the memory so that the system resembles a cluster in a single box more than a multi-CPU computer (NUMA). The second option doesn't work unless you have separate CPU chips, a heap of cores on a single chip will have electrical problems connecting enough pins for several fully independent memory buses.
I can't imagine what exactly "not having many registers" and "scalable" have in common. RISC cores like PPC (which has 64 registers) can be packed more densely on a single chip than x86 cores so it isn't size related. Smaller numbers of registers are correlated with lower performance so it isn't that either.

I'd rather have few registers, and go directly to memory. The hardware can then scale to include bigger and faster caches, so that memory access is just as fast a register access, without the software having to deal with register allocation and save/restore.
A memory-only machine without programmable registers may work with internal control logic that maps addresses on to registers without direct programming but that sort of thing is generally a bad idea. The control logic is rarely optimal as there isn't time to look ahead and perform a full code analysis, current CPUs already try things like that (called register renaming, modern x86 has a lot of registers; it just maps the 8 programmable ones on to the larger set) and it just doesn't work as well as having the compiler allocate registers appropriately in advance.
Oh, and cache can never be as fast as registers, that is, after all, the difference between memory and registers (memory is big and slow, registers are tiny and crazy fast). You'll have to wait until we invent Unobtanium based computers for that to be possible (hardware that is not limited by the speed of light).
Going back to the earlier statement about cache coherency. Registers are local state, the contents of registers are not visible to other CPUs so the CPU doesn't need to worry about coherency problems with those values. Direct memory access is always incoherent and forces the CPU to behave defensively (i.e. slowly). [Direct memory access on x86 is convenient, but that's ultimately because it's a solution to a problem of its own design, the lack of registers]
Re:Instruction set... by Arlet · 2010-11-21 22:35 · Score: 3, Insightful

The only reason x86 is still around (i.e. successful) is because it's pretty much backwards compatible since the 8086- which is over THIRTY YEARS OLD.
That's a clear testament to scalability when you consider the speed improvement in the last 30 years using basically the same ISA.

you might be interested to find out that you can have a fifteen byte instruction
So ? It's not the maximum instruction length that counts, but the average. In typical programs that's closer to three. Frequently used opcodes like push/pop only take a single byte. Compare to a DEC Alpha architecture, where nearly every single instruction uses 15 bits just to tell which registers are used, no matter whether a function needs that many registers.

If Intel would just abandon x86, they could reduce their cores by something like 50%!
Even if that's true (I doubt it), who cares ? The problem is not intel has too many transistors for a given area. The problem is just the opposite. They have the capability to put more transistors in a core that they know what to do with. Also, typically half the chip is for the cache memories, and the compact instruction set helps to use that cache memory more effectively.

one cannot simply rely on a shared memory architecture to scale vertically indefinitely
Sure you can. Shared memory architectures can do everything explicit channel communication architectures can do, plus you have the benefit that the communication details are hidden from the programmer, allowing improvements to the implementation without having to rewrite your software. Sure, the hardware is more complex, but transistors are dirt cheap, so I'd rather put the complexity in the hardware.
Re:Instruction set... by Arlet · 2010-11-21 22:48 · Score: 1

Err, did you just claim cache is as fast as a register access?
There's no reason why it shouldn't be. Don't forget that register access comes with the overhead of load/store plus the fact that you may have to save the register when calling a function, and during interrupts/context switches. Direct memory access doesn't have all that overhead, and if you throw enough control logic around it, a small cache can be just as fast.
Memory also offers the possibility of storing other things than 32/64 bit integers, such as character strings and local structs, so any optimization done to aggressively cache local stack access will also benefit that kind of code. Try doing an efficient strcpy() on an ARM for instance.
Re:Instruction set... by kohaku · 2010-11-21 23:24 · Score: 3, Interesting

That's a clear testament to scalability when you consider the speed improvement in the last 30 years using basically the same ISA.
It's scaled that way until now. We've hit a power wall in the last few years: as you increase the number of transistors on chip it gets more difficult to distribute a faster clock synchronously, so you increase the power, which is why Nehalem is so power hungry, and why you haven't seen clock speeds really increase since the P4. In any case, we're talking about parallelism, not just "increasing the clock speed" which isn't even a viable approach anymore.
When you said "Compact" I assumed you meant the instruction set itself was compact rather than the average length- I was talking about the hardware needed to decode, not necessarily code density. Even so, x86 is nothing special when it comes to density, especially considered against things like ARM's Thumb-2.
If you take look at Nehalem's pipeline, there's a significant chunk of it simply dedicated to translating x86 instructions into RISC uops, which is only there for backwards compatability. The inner workings of the chip don't even see x86 instructions.
Sure you can do everything the same with shared memory and channel comms, but if you have a multi-node system, you're going to be doing channel communcation anyway. You also have to consider that memory speed is a bottleneck that just won't go away, and for massive parallelism on-chip networks are just faster. In fact, Intel's QPI and AMD's HyperTransport are examples of on-chip network- they provide a NUMA on Nehalem and whatever AMD have these days. Indeed, in the article, it says

Mattson has argued that a better approach would be to eliminate cache coherency and instead allow cores to pass messages among one another.

The thing is, if you want to put more cores on a die, you need either a bigger die or smaller cores. x86 is stuck with larger cores because of all the translation and prediction it's required to do to be both backwards compatible and reasonably well-performing. If you're scaling horizontally like that, you want the simplest core possible, which is why this chip only has 48 cores, and Clearspeed's 2-year-old CSX700 had 192.
Re:Instruction set... by Arlet · 2010-11-21 23:49 · Score: 4, Interesting

The thing is, if you want to put more cores on a die, you need either a bigger die or smaller cores
Nobody wants to put more cores on a die, but they're forced to do so because they reach the limits of a single core. I'd rather have as few cores as possible, but have each one be really powerful. Once multiple cores are required, I'd want them to stretch the coherent shared memory concept as far as it will go. When that concept doesn't scale anymore, use something like NUMA.
Small, message passing cores have been tried multiple times, and they've always failed. The problem is that the requirement of distributed state coherency doesn't go away. The burden only gets shifted from the hardware to the software, where it is just as hard to accomplish, but much slower. In addition, if you try to tackle the coherency problem in software, you don't get to benefit from hardware improvements.
Re:Instruction set... by wiredlogic · 2010-11-22 00:27 · Score: 1

Before you keep spouting off, consider that every x86 processor since the Pentium Pro has been a RISC core with a section translating the CISC ISA into an internal representation (with 40 registers at its disposal). The x86 ISA is compressed compared to its RISC equivalent. In fact ARM and MIPS provide options to do exactly the same thing to alleviate the bloat of having 32-bit-everything in the instruction stream. Your atypical worst case example doesn't count for much in the real world.

--
I am becoming gerund, destroyer of verbs.
Re:Instruction set... by kohaku · 2010-11-22 00:33 · Score: 3, Interesting

they're forced to do so because they reach the limits of a single core
Well yes, but you might as well have argued that nobody wanted to make faster cores but they're limited by current clock speeds... The fact is that you can no longer make cores faster and bigger, you have to go parallel. Even the intel researcher in the article is saying the shared memory concept needs to be abandoned to scale up.
Essentially there are two approaches to the problem of performance now. Both use parallelism. The first (Nehalem's) is to have a 'powerful' superscalar core with lots of branch prediction and out-of-order logic to run instructions from the same process in parallel. It results in a few, high performance cores that won't scale horizontally (memory bottleneck)
The second is to have explicit hardware-supported parallelism with many many simple RISC or MISC cores on an on-chip network. It's simply false to say that small message passing cores have failed. I've already given examples of ones currently on the market (Clearspeed, Picochip, XMOS, and Icera to an extent). It's a model that has been shown time and time again to be extremely scalable, in fact it was done with the Transputer in the late 80s/early 90s. The only reason it's taking off now is because it's the only way forward as we hit the power wall, and shared memory/superscalar can't scale as fast to compete. The reason things like the Transputer didn't take off in mainstream (i.e. desktop) applications is because they were completely steamrolled by what x86 had to offer: an economy of scale, the option to "keep programming like you've always done", and most importantly backwards compatability. In fact they did rather well in i/o control for things such as robotics, and XMOS continues to do well in that space.
The "coherency problem" isn't even part of a message passing architecture because the state is distributed amongst the parallel processes. You just don't program a massively parallel architecture in the same way as a shared memory one.
Re:Instruction set... by kohaku · 2010-11-22 00:50 · Score: 1

I'm not arguing against the design of the processors- a RISC core is probably the best way to implement them. My point is that if you already have a RISC core, the original (x86) instruction set needs to be ditched because decoding it wastes space. The worst case example was supposed to indicate the complexity of the instruction set and its decoding, not the code density. If you're talking about ARM's Thumb/Thumb2, that's a little different- Thumb2 is trivially easy to decode, whereas x86 definitely isn't. I would argue that x86 isn't particularly more dense than a RISC equivalent (let's take Thumb2), since the very long, complex instructions are very infrequently used (although I can't find any statistics). Also, many x86 instructions take many cycles to complete, meaning even more potential pipeline slowdowns.
Re:Instruction set... by Arlet · 2010-11-22 00:52 · Score: 1

There's a third option: combine the best of both worlds. Use powerful, superscalar cores with shared memory, as powerful as you can reasonably make them, and then run clusters of those in parallel.

You just don't program a massively parallel architecture in the same way as a shared memory one.
Well, there's your problem. Many real world applications can only be programmed that way.
And the fact that multiple simple cores are currently on the market doesn't mean they're not failures. The ClearSpeed 192-core CSX700 is on the market, but nobody is buying it. In my book that's a failure.
Re:Instruction set... by kohaku · 2010-11-22 01:05 · Score: 2, Interesting

There's a third option: combine the best of both worlds. Use powerful, superscalar cores with shared memory, as powerful as you can reasonably make them, and then run clusters of those in parallel.
Which is of course what is already being done, but whether that's the best approach remains to be seen. Communication is always the bottleneck in HPC systems, and many processors on chip with a fast interconnect seems to do very well, at least for Picochip (though it is a DSP chip, I think it's a valid comparison).

Well, there's your problem. Many real world applications can only be programmed that way.
Examples? It's just a different model, it's doesn't prevent you solving any problem.

The ClearSpeed 192-core CSX700 is on the market, but nobody is buying it
Yeah, that was a shame. The trouble is that HPC-specific chips are just going to get steamrolled on the price point by commodity (x86) hardware. But what about the other three that are selling like hotcakes?
Re:Instruction set... by Anonymous Coward · 2010-11-22 01:32 · Score: 1, Interesting

Thanks for the links! I'll add GreenArrays and XMOS, although the GA interconnect seems overly primitive. Tilera tends to be mentioned a lot.
Re:Instruction set... by Kjella · 2010-11-22 01:40 · Score: 1

the instructions are so complex it's not even worth writing a "real" decoder- they're translated in real-time into a RISC instruction set! If Intel would just abandon x86, they could reduce their cores by something like 50%!
Tht layer has been there since some of the earliest Pentiums and take up almost no space on a modern CPU.

--
Live today, because you never know what tomorrow brings
Re:Instruction set... by kohaku · 2010-11-22 01:58 · Score: 1

What layer? Their decoder is the translation, and although it doesn't take up 50%, it's not a trivial amount of space. Not only space, though, but pipeline: an instruction gets 5-deep into the pipeline just in terms of decoding whereas an equivalent A8 pipeline is only 3 stages. Branch penalties on x86 are nasty, which is why there's so much logic (caching decoded instructions, branch statistics, etc.) dedicated to alleviating the problem.
Re:Instruction set... by Arlet · 2010-11-22 02:26 · Score: 2, Insightful

Examples? It's just a different model, it's doesn't prevent you solving any problem.
A typical consumer desktop machine, running typical programs for instance. In order to use these cores effectively, all these programs need to rewritten. Imagine your word processor reformatting a 500 page document on 1000 cores. It's just not going to work very well.
How about the operating system ? 1000 different cores all trying to access a file system on a single physical drive. How are you going to run that efficiently ?
Re:Instruction set... by kohaku · 2010-11-22 02:34 · Score: 1

Well sure, but that's why x86 is still around. It doesn't mean it's better, and it doesn't mean it's going to scale better.
As for multiple processes accessing the same drive, you'd handle it the same way as we do now: with a filesystem layer serving all those processes. That they're running concurrently has no bearing.
Re:Instruction set... by Arlet · 2010-11-22 03:32 · Score: 2, Interesting

We way we do it now is a single filesystem layer which is, at all times, in a single coherent state. With today's shared memory systems, and cache coherency guaranteed by the hardware, that's reasonable easy to accomplish.
The current filesystem concept just doesn't map onto 1000 non-coherent cores.
Re:Instruction set... by renoX · 2010-11-22 05:10 · Score: 1

Wrong and wrong. The x86 instruction is hardly compact with all its redundancy and roundabout inconsistent operations. ARM Thumb coding is superior in just about every way.
Short of 64 bitness, which it don't have (and AFAIK, even the new PAE-like mode for ARM don't exist in a real product yet)!
If you want a RISC, why not MIPS? It has "64 bit register" and "16 bit instruction" extensions, though I don't know if there are MIPS CPUs which have both extensions..
Re:Instruction set... by Asm-Coder · 2010-11-22 05:36 · Score: 1

Slow down there. *Cache* comes with the extra overhead of load/store.
Example: I want to add two numbers
In registers:
add $1,$1,$2 # Add reg $1 & $2 and store in $1

In cache:
lw $1,0x8001 #Get the first memory location lw $2,0x8002 #Get the second memory location add $1,$1,$2 #Preform the addition sw 0x8001,$1 #stro the result back into memory
As you can see, we have a gain of 3 instructions, and probably even more clock cycles, depending on cache read speed, and whether or not the memory of interest is even in the cache (cache miss).

As for your discussion about DMA, it *could* compare with the speed of registers on some implementations. That's a matter of engineering, And at an electrical engineer, I'm going t o make the guess that the highest performance will always come from registers, simple because they are located closer to the ALU. (cache can't be significantly closer because the cache is frequently bigger than the rest of the processor. It's a simple matter of weight^H^H^H^H^H^H geometry.) DMA has other advantages that make it useful, as as in your example about the ARM processor, but strcopy() on an ARM processor is executed by a dedicated hardwired circuit. (copying memory doesn't actually require preforming calculations on the data, so the ALU isn't needed and the data doesn't even really need to hit the processor at all.) For computational work, registers are still (currently) faster.
As I write this, it occurs to me that the above 1000-core processor with the cores distributed throughout the cache field might not allow all the processors to remain busy, but might allow for the paths from the processor to the cache to be short enough to make computation with DMA as fast as operating from the registers, but I don't know how clock skew across the processor might affect this idea.
Re:Instruction set... by Arlet · 2010-11-22 05:50 · Score: 1

With 'direct memory access' I was referring to a CISC instruction set that could directly access memory as one or more of its operands. Like this:
add 0x8001, 0x8002
Which is just a single instruction.
Re:Instruction set... by Rockoon · 2010-11-22 05:52 · Score: 1

Since x86 is a two-operand instruction set, you generally need more instructions to accomplish the same work.
Since x86 has read/modify/write instructions that go directly to ram, its far more compact in practice.

--
"His name was James Damore."
Re:Instruction set... by Rockoon · 2010-11-22 06:08 · Score: 1

Lack of registers does hold the architecture back. Unfortunately the number is defined right into the byte script. AMD was able to extend it a bit (by 4). That has helped a lot (in some workloads up to 30%-50%). This is helping reduce pipeline stalls. Unfortunately to use these new 4 registers you need to use 64 bit. Which in many cases wastes cache space.
There are more than 4 new registers. There are in fact 16 new register, 8 of which are general purpose integer (named r8, r9, r10, r11, r12, r13, r14, and r15) as well as 8 new SSE registers which can perform both SIMD duties and SISD duties of varying sizes (8-bit, 16-bit, 32-bit, and 64-bit integers, as well as 32-bit and 64-bit floats) named xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, and xmm15.

I dont know where you get your info, but you are god damned clueless about AMD64.

--
"His name was James Damore."
Re:Instruction set... by Rockoon · 2010-11-22 06:30 · Score: 1

This isnt really a valid point.

First, when RISC was cleaning up against Intel, AMD's CISC was also cleaning up against Intel.

RISC was leading the performance war only because they *did* get the clock speeds much higher, but they couldn't keep that advantage up because CISC doesnt rule out RISC-like instruction.. the opposite is true! RISC rules out CISC-like instruction.

So what we have now with x86 is a core set of simple instructions that translate directly to uops .. and a lot of other instructions which translate into many uops. The thing with "a lot of other instructions" is that it only takes 1 more bit to double the size of the instruction set, and only 1 more to double it again...

Could x86/x64 be beautified? Sure! Is it worth it? RISC itself is not a panacea, that the optimal design DOES have some complex instructions.. Intel/AMD have proven that supporting complex instructions is not a real design problem, while RISC lost because not supporting complex instructions did become a real design problem.

--
"His name was James Damore."
Re:Instruction set... by blair1q · 2010-11-22 08:49 · Score: 1

If Intel would just abandon x86, they could reduce their cores by something like 50%!

They tried that once, with IA-64, although VLIW made cores bigger, not smaller. AMD stuck with x86 and came close to taking over the CPU market. Intel went back to x86 and AMD went back to being a scavenger.
If you want RISC, you can still get PPC chips from FreeScale. Of course, not even Apple will use them any more.
Re:Instruction set... by ultranova · 2010-11-22 10:38 · Score: 1

Err, did you just claim cache is as fast as a register access?

A register set is nothing more than cache with a special namespace for its memory locations. Why should it be handled differently than all other caches (that is, transparently to the program)?

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:Instruction set... by ultranova · 2010-11-22 10:44 · Score: 1

And at an electrical engineer, I'm going t o make the guess that the highest performance will always come from registers, simple because they are located closer to the ALU.

Please note that the instruction set and the actual physical implementation of the processor are two different things. Even if the latter includes registers, there's no reason why the former should expose this detail; the processor's instruction decoder and scheduler are far more qualified to worry about such technical details than the compiler (which might have been written before the processor even existed).

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:Instruction set... by ultranova · 2010-11-22 11:02 · Score: 1

Imagine your word processor reformatting a 500 page document on 1000 cores. It's just not going to work very well.

Format every paragraph in parallel, then simply paste them into the page list one after another. Now, a more fancy layout's going to require more fancy algorithms, but this actual problem is nonetheless a good candidate for parallelization.

How about the operating system ? 1000 different cores all trying to access a file system on a single physical drive. How are you going to run that efficiently ?

If they're actually going to hit the disk, it doesn't matter since it's going to take a virtual eternity for the disk to respond. If they'll stay in cache, it's just the matter of finding the node responsible for the file you want and then getting that.
In any case, you can make a shared-memory system on top of 1000 message-passing cores, it's simply going to suck for programs that have poor locality of reference - but then again, modern CPUs already suck for them.

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:Instruction set... by sjames · 2010-11-22 12:41 · Score: 1

Advances in compilers have a huge role to play as well. Many of the current crop of programmers have enough trouble not screwing up multi-threaded programming. They wouldn't make it at all if everything looked like a Transputer where you have to actually plan what data will be needed where and when. SIMD style parallelism is really easy to screw up and get depressing performance. You see it a lot in DSPs because the problems there lend themselves well to the hardware model and the programmers there tend to be above average. It's still a quite a challenge to do GP processing on one.
Re:Instruction set... by sjames · 2010-11-22 15:10 · Score: 1

Actually, there's a continuous spectrum with embarrassingly parallel at one end (think any of the @home apps) and strictly serial at the other. For everything in-between, the problems may be latency sensitive, bandwidth sensitive, or both. Some can run in-cache and others make the cache practically worthless.
The difficulty for the HPC specific hardware is that it is extremely good at a subset of the problems and terrible at the others. Because it is low volume, it costs enough that just throwing GP CPUs at the problem often looks like a better answer, especially if the workload may change later.
As for the other 3, they are embedded DSPs. That is, they're targeted at a very specific set of problems that just happen to be very popular (due to wide usefulness). Notably when they are used in devices that need an actual OS, they are typically paired with an embedded x86, ARM, or MIPS to run the OS.

Re:Imagine by Jeremy+Erwin · 2010-11-21 19:54 · Score: 1

Imagine a Beowulf with all of the overhead, and none of the speed.

For simplicity's sake, the team used an off-the-shelf 1994-era Pentium processor design for the cores themselves. "Performance on this chip is not interesting," Mattson said. It uses a standard x86 instruction set.

1000 cores is nothing by Anonymous Coward · 2010-11-21 19:55 · Score: 5, Interesting

Probably in future 1 million cores is minimum requirement for applications. We will then laugh for these stupid comments...

Image and audio recognition, true artificial intelligence, handling data from huge amount of different kind of sensors, movement of motors (robots), data connections to everything around the computer, virtual worlds with thousands of AI characters with true 3D presentation... etc...etc... will consume all processing power available.

1000 cores is nothing... We need much more.

Re:1000 cores is nothing by Electricity+Likes+Me · 2010-11-21 20:48 · Score: 2, Insightful

1000 cores at 1Ghz on a single chip, networked to a 1000 other chips, would probably just about make a non-real time simulation of a full human brain possible (going off something I read about this somewhere). Although if it is possible to arbitrarily scale the number of cores, then we might be able to seriously consider building a system of very simple processors acting as electronic neurons.
Re:1000 cores is nothing by satan666 · 2010-11-22 05:25 · Score: 1

Now you're talking! That's exactly what we need to run Windows efficiently!
Re:1000 cores is nothing by pitchpipe · 2010-11-22 07:03 · Score: 2, Funny

Yes, an while we are at it gat a working EMH/ECH(star-trek voyager) and a mobuile emitor :)
Or a working spell check program ;-)

--
Look where all this talking got us, baby.

Not for the consumer market by Askmum · 2010-11-21 19:58 · Score: 1

Okay, I'm sure some high-end consumers would benefit from this, I think the majority of consumers will not. The number of multithreaded programs on my Windows computer can be counted on one hand I think. Java being the major one, if and only if the programmers want to program multithreaded.

At this point in time I'd rather have a dual core 3 GHz processor than a quad or octa core 2 GHz processor.

"Build it and they will come" - NOT by Animats · 2010-11-21 20:34 · Score: 4, Informative

It's an interesting machine. It's a shared-memory multiprocessor without cache coherency. So one way to use it is to allocate disjoint memory to each CPU and run it as a cluster. As the article points out, that is "uninteresting", but at least it's something that's known to work.

Doing something fancier requires a new OS, one that manages clusters, not individual machines. One of the major hypervisors, like Xen, might be a good base for that. Xen already knows how to manage a large number of virtual machines. Managing a large number of real machines with semi-shared memory isn't that big a leap. But that just manages the thing as a cluster. It doesn't exploit the intercommunication.

Intel calls this "A Platform for Software Innovation". What that means is "we have no clue how to program this thing effectively. Maybe academia can figure it out". The last time they tried that, the result was the Itanium.

Historically, there have been far too many supercomputer architectures roughly like this, and they've all been duds. The NCube Hypercube, the Transputer, and the BBN Butterfly come to mind. The Cell machines almost fall into this category. There's no problem building the hardware. It's just not very useful, really tough to program, and the software is too closely tied to a very specific hardware architecture.

Shared-memory multiprocessors with with cache coherency have already reached 256 CPUs. You can even run Windows Server or Linux on them. The headaches of dealing with non-cache-coherent memory may not be worth it.

Re:Imagine by AuMatar · 2010-11-21 20:35 · Score: 1, Insightful

Why would you care to see one on your desktop? Do you have any use for one? There's a point where except for supercomputers enough is enough. We've probably already passed it.

--
I still have more fans than freaks. WTF is wrong with you people?

Re:Imagine by seifried · 2010-11-21 20:37 · Score: 5, Informative

Linux can only go to 256 cores.

Uhmm no.

./arch/ia64/Kconfig: int "Maximum number of CPUs (2-4096)"
/arch/powerpc/platforms/Kconfig.cputype: int "Maximum number of CPUs (2-8192)"

In x86 we have:

config MAXSMP
bool "Enable Maximum number of SMP Processors and NUMA Nodes"
depends on X86_64 && SMP && DEBUG_KERNEL && EXPERIMENTAL

And I believe you can crank that dial all the way up

Also consider this: the number of cores in my desktop is doubling every year or two (and this is with a single core chip), 6 and 8 cores are cheap now, so we'll be at 1024 in roughly 7-14 years which makes sense because the GHz war is done and simply making more cores is relatively cheap (once you have the interconnect making a bigger CPU isn't all that hard).

I/O and memory bandwidth by francium+de+neobie · 2010-11-21 20:44 · Score: 3, Insightful

Ok, you can cram 1000 cores into one CPU chip - but feeding all 1000 CPU cores with enough data for them to process and transferring all the data they spit out is gonna be a big problem. Things like OpenCL work now because the high end GPUs these days have 100GB/s+ bandwidth to the local video memory chips, and you're only pulling out the result back into system memory after the GPU did all the hard work. But doing the same thing on a system level - you're gonna have problems with your usual DDR3 modules, your SSD hard disk (even PCI-E based) and your 10GE network interface.

Re:I can put a thousand cores on a chip... by mr_mischief · 2010-11-21 20:49 · Score: 1

Right. The really interesting chips will arrive when you run between four and sixteen cores with the entirety of main RAM for those cores (in a NUMA configuration with other sockets, starting with maybe a gigabyte or so per die). You could then use SDRAM for both a paging file and for cache between the storage system and the processor/memory die.

You could map registers straight to portions of the on-chip memory if necessary for backwards compatibility. You'd probably be better off, though, compiling nearly everything to just use memory addressing. You'd only hit the SDRAM to load a new entire page into the on-chip RAM. On-chip cache and the circuitry to minimize misses in the cache could mostly go away, and the cores themselves could be simplified. You might even get away with moving the SDRAM controller back off-chip at first to free up some space on the die since the working memory would be so fast once the data was in it.

Unfortunately, this assumes billions of switches just for the main memory and probably quality control nightmares in the first several models.

However, it's the logical conclusion for the way forward. Caches keep taking more die space to deal with the fact that memory is so much slower than processors. Once you get over a certain size cache, you're just wasting circuitry on managing a large block of memory in little chunks that's better treated as a large single block of memory. The virtual to physical mapping already figures out what's in main RAM and what's out in the swap. Just let it do that with the on-die memory and eliminate the extra cache logic to make more on-die memory.

Intel has mentioned putting main memory on the die already. They even mentioned that they could do it with a form of DRAM rather than with SRAM.

Deja Vu from a decade ago by Baldrson · 2010-11-21 20:55 · Score: 2, Informative

It seem like I've been here before.

A little while ago you asked Forth (and now colorForth) originator Chuck Moore about his languages, the multi-core chips he's been designing, and the future of computer languages -- now he's gotten back with answers well worth reading, from how to allocate computing resources on chips and in programs, to what sort of (color) vision it takes to program effectively. Thanks, Chuck!

--
Seastead this.

Re:Imagine by visualight · 2010-11-21 20:58 · Score: 2, Interesting

http://www.sgi.com/products/servers/altix/uv/

2,048 cores (256 sockets) and 16TB of memory, one OS image.

--
Samsung took back my unlocked bootloader because Google wants me to rent movies. They're both evil.

This is NOT a cache-coherent/SMP machine! by Terje+Mathisen · 2010-11-21 20:58 · Score: 2, Insightful

The key difference between this research chip and the other Multicore chips Intel have worked on, like Larrabee, is that it is explicitly NOT cache coherent, i.e. it is a cluster on chip instead of a single-image multi-processor.

This means, among many other things, that you cannot load a single Linux OS across all the cores, you need a separate executive on every core.

Compare this with the 7-8 Cell cores in a PS3.

Terje

--
"almost all programming can be viewed as an exercise in caching"

Performance gains from multithreading not clock by perpenso · 2010-11-21 21:00 · Score: 1

Am i the only one feeling this is just a foray into multicore chips because they hit a brick wall when it comes to faster single core CPUs?

For many years (at least 5, possibly more) Intel has been telling developers that future performance gains will come from multithreading not faster clock speeds. So no, you are not the only one feeling this way. :-)

Remember the last couple of times this happened? by Required+Snark · 2010-11-21 21:02 · Score: 5, Informative

This is at least the third time that Intel has said that it is going to change the way computing is done.

The first time was the i432 http://en.wikipedia.org/wiki/Intel_iAPX_432 Anyone remember that hype? Got to love the first line of the Wikipedia article "The Intel iAPX 432 was a commercially unsuccessful 32-bit microprocessor architecture, introduced in 1981."

The second time was the Itanium (aka Itanic) that was going to bring VLIW to the masses. Check out some of the juicy parts of the timeline also over on Wikipedia http://en.wikipedia.org/wiki/Itanium#Timeline

1997 June: IDC predicts IA-64 systems sales will reach $38bn/yr by 2001

1998 June: IDC predicts IA-64 systems sales will reach $30bn/yr by 2001

1999 October: the term Itanic is first used in The Register

2000 June: IDC predicts Itanium systems sales will reach $25bn/yr by 2003

2001 June: IDC predicts Itanium systems sales will reach $15bn/yr by 2004

2001 October: IDC predicts Itanium systems sales will reach $12bn/yr by the end of 2004

2002 IDC predicts Itanium systems sales will reach $5bn/yr by end 2004

2003 IDC predicts Itanium systems sales will reach $9bn/yr by end 2007

2003 April: AMD releases Opteron, the first processor with x86-64 extensions

2004 June: Intel releases its first processor with x86-64 extensions, a Xeon processor codenamed "Nocona"

2004 December: Itanium system sales for 2004 reach $1.4bn

2005 February: IBM server design drops Itanium support

2005 September: Dell exits the Itanium business

2005 October: Itanium server sales reach $619M/quarter in the third quarter.

2006 February: IDC predicts Itanium systems sales will reach $6.6bn/yr by 2009

2007 November: Intel renames the family from Itanium 2 back to Itanium.

2009 December: Red Hat announces that it is dropping support for Itanium in the next release of its enterprise OS

2010 April: Microsoft announces phase-out of support for Itanium.

So how do you think it will go this time?

--
Why is Snark Required?

Cores are not executing x86 instructions by perpenso · 2010-11-21 21:21 · Score: 1

How about developing a small efficient core, where the performance is interesting? Actually, don't even bother; just reuse the DEC Alpha instruction set that is collecting dust at Intel. There is no point in tying these massively parallel architectures to some ancient ISA.

Technically the cores are not executing x86 instructions. For several architectural generations of Intel chips the x86 instructions have been translated into a small efficient instruction set executed by the cores. Intel refers to these core instructions as micro-operations. An x86 instruction is translated on the fly into some number of micro-ops and these micro-op are reordered and scheduled for execution. So they have kind of done what you ask, the problem is that they don't give us direct access to the micro-op instructions set.

Intel tried to move beyond x86 with the Itanium and the market said no. The market also said no to Alpha and PowerPC, both of which had consumer oriented Windows NT 4 support. Even Apple had to give up on PowerPC and they were part of the PowerPC consortium. There is no Intel x86 conspiracy, they are trapped too.

Re:Imagine by hairyfeet · 2010-11-21 21:27 · Score: 1

Actually I'd say the thing that is scaring the crap out of Intel is that "good enough" was passed for most folks quite a few miles back. I have several customers as well as my GF on late model P4s and you know what? Most of the time those 2.8GHz+ machines are sitting there twiddling their silicon thumbs. The simple fact is Youtube, FB, email, and surfing just don't take that much juice. And I'm sure the fact that those I've been able to upsell to new multicores only did so because AMD is really cheap now certainly don't help Intel none either.

Which brings me to TFA which I'd say just shows how Intel don't seem to see the real problem: The problem is that parallel programming ain't easy and most apps just don't scale well past a couple of cores. There just hasn't been a "killer app" for pushing the masses to true multicore computing. While I know that TFA is directed towards servers pushing major code that really is a small niche compared to the consumer space. What Intel and AMD need to do is find that "killer app" that will get all those running those late model P4s to drop them like a bad habit for the new hotness. Hell I usually have my family on the fast track to new hotness because I like to game, but my boys have been playing MMOs just fine with my P4 hand me downs so I really don't even see a point to upgrading. There really hasn't been any "killer app" to push adoption like we saw in the MHz race. Hell even hardcore gaming (a pretty tiny but tech heavy niche) hasn't really seen any benefits above going dual, with few games gaining in triple much less quad. If Intel and AMD want to push multicores somebody really needs that "killer app" to come out and stat.

--
ACs don't waste your time replying, your posts are never seen by me.

cue kilocore debates by bingoUV · 2010-11-21 21:31 · Score: 2, Interesting

Do 1024 cores constitute a kilocore? Or 1000? I'd love to see that debate move from hard disks to processors.

--
Bingo Dictionary - Pragmatist, n. A myopic idealist.

Re:cue kilocore debates by arkane1234 · 2010-11-22 02:33 · Score: 1

No, but it appears you need a lesson in computing in the x86 world.
Memory has always been counted in 1024 bytes per kilobyte, 1024 kilobytes equaling a megabyte.
This applies to disk, as well. It's a binary translation.
So, by proxy, the person was assuming (jokingly I'd imagine) that the kilo would equal the same number. Thus, 1024 cores equaling 1 kilocore.
If you knew it but were just being snide, then your just pretentious.

--
-- This space for lease, low setup fee, inquire within!
Re:cue kilocore debates by swilly · 2010-11-22 05:16 · Score: 1

Since about 2000, standards committees have been pushing kibi, mebi, and gibi prefixes for binary usage instead of overloading the decimal prefixes.
Note that kilo = 1024 has mainly been commonly used for memory, and most other computer components used kilo to mean 1000. This is the reason for large discrepancies when reporting hard drive sizes. 1000 is also used for CPU clock frequencies, network speeds (mostly, there are a few odd exceptions).
I think that kibi and company sounds stupid, but reducing ambiguity is a good thing and they are slowly becoming more popular. We are starting to see operating systems report binary sizes using the Ki prefix instead of the K prefix.
Re:cue kilocore debates by Twinbee · 2010-11-27 10:42 · Score: 1

Glad to hear they're finally beginning to disambiguate between those two horrors! Ideally of course, humanity would be using a base 8 or 16 (though 12 looks good) number system here instead of 10 and scrap any base 10 stuff, but I doubt that will happen for a while yet!

--
Why OpalCalc is the best Windows calc

Re:Imagine by AlecC · 2010-11-21 21:32 · Score: 1

You need a different programming model. Our current imperative programming languages inherently assume a single thread, with multi-threading as a huge lump on the side. In a multi-programming model, something like (say) a compiler would code-generate for every function in parallel without ever being asked explicitly to do so. Each function is a stand alone unit, so they can be done in parallel; in an appropriately designed system they would be done in parallel. In GUI programs, updating separate elements of the display would be done independently without needing to ask for it.

But we need a completely different programming paradigm to achieve this. Functional programming might be that paradigm - or it might not. The point of the chip in the original article is to allow researchers to work on this problem. As the article says, the performance of each core in the the chip is very pedestrian. But if researchers can develop software tools that allow the chip to perform at, say, five times the performance of a single core (on a 48 core machine) without programmers having to partition threads explicitly, they will have achieved what the project is about.

--
Consciousness is an illusion caused by an excess of self consciousness.

Re:Imagine by pyalot · 2010-11-21 21:33 · Score: 4, Informative

You're having a supercomputer on your desk right now. It's called a "GPU", and most likely, it sports many hundred cores. Oh, and the killer app you mean, that's whatever latest DX11/Opengl4 game you prefer.

--
Experiments and other stuff

as many cores as needed by __aatirs3925 · 2010-11-21 21:33 · Score: 1

Get me as many cores as needed so Windows will stop pausing to open a folder even on a freshly formatted computer. Instant, instant instant....

Re:as many cores as needed by Vectormatic · 2010-11-22 01:16 · Score: 1

you don't need more cores for that, just a single programmer at microsoft who has half a clue of how to do proper i/o scheduling

--
People, what a bunch of bastards

Consumer oriented products can use many cores by perpenso · 2010-11-21 21:35 · Score: 1

Okay, I'm sure some high-end consumers would benefit from this, I think the majority of consumers will not.

As a game developer I have to say consumers could benefit. And no I am not necessarily thinking about more graphical eye candy. For example I would like to have hundreds of cores working on AI for computer controlled characters/units.

Re:Imagine by pyalot · 2010-11-21 21:36 · Score: 1

"Enough with this sillyness already, we don't need a supercomputer! Now let me get back to play my latest DX11 compute enabled game with the awesome physics and graphics."

--
Experiments and other stuff

Re:Imagine by Siffy · 2010-11-21 21:38 · Score: 2, Interesting

Why not? He/she built a cluster for no use at all other than learning and fun. I can easily see the "use" for 1k cores with Intel's apparent interest to get into the 3d market or at least destroy Nvidia and ATI (something AMD has already done in name but that's beside the point). For clusters it's a no-brainer to keep adding cores if you can increase performance per watt ratio with each additional core. For desktops there likely will be a point where enough is enough, but I disagree that we've passed it. Software designers are still keeping up quite quickly with any headroom new hardware creates.

Re:Imagine by bloodhawk · 2010-11-21 22:14 · Score: 2, Informative

Why? :) I know. meme. It's just, I've built a couple Beowulf clusters for fun, and didn't have an application written to use MPI (or any of the alphabet soup of protocols), so it was just an exercise, not for any practical use. It's not like most of us are crunching numbers hard enough to need one, and it won't help out playing games or even building kernels.

I'd like to see a 1k core machine on my desktop, but that's beyond the practical limits of any software currently available. Linux can only go to 256 cores. Windows 2008 tops out at 64. But hey, if they did come to market, I know who would be first to support all those cores, and it doesn't come from Redmond (or their offshore outsourced developers).

ummm no. Windows 2008 can handle 64 SOCKETS, it currently scales to 256 cores

Re:Imagine by dbIII · 2010-11-21 22:23 · Score: 1

Video editing can use it, photo editing can come close and games that model 3D environments do some trivially parallel stuff where more processing helps.
I want to see this on the desktop so that it drives down prices for cluster nodes for geophysics, FEA etc :)

Re:Windows testing by pinkushun · 2010-11-21 22:23 · Score: 1

This model does allow for 1000 times the BSOD dosage!

Re:Imagine by donaldm · 2010-11-21 22:32 · Score: 1

Why would you care to see one on your desktop? Do you have any use for one? There's a point where except for supercomputers enough is enough. We've probably already passed it.

It depends on what you what to do with those cores. Just running an Office application will not tax more than one or two cores since these type of applications are effectively real time and are not cpu instensive. In may respects Games also fall into that category with many modern games making more use of the graphics processor than the cpu's.

Having multiple cores is very useful when your application is cpu intensive and can fork processes onto as many cores that are available. a simple example of this is a video format converter which is very cpu intensive rather than I/O intensve. I run the video converter called HandBreak under Fedora 14 which can easily hammer my Intel i7 processor. This raises my load average to over 9 with each core running at approx 90% and you can really feel the heat (approx 90 deg C on the cores) being extracted by the fan.

Actually the biggest problem with multiple cores is heat and how to get rid of it as well as latency between processors and memory although according to the article Intel researcher Timothy Mattson has suggested how to get around that problem in a white paper.

--
There ain't no such thing as proprietary standards only proprietary formats. Standards are by definition open.

Real time raytracing of course! by Joce640k · 2010-11-21 22:34 · Score: 2, Informative

Isn't that Intel's pet project for the last decade?

--
No sig today...

Re:Imagine by nikanth · 2010-11-21 22:50 · Score: 2, Informative

Linux can only go to 256 cores. Windows 2008 tops out at 64.

Linux supports more than 256 cores.

MAINLINE:

Maximum number of CPUs / CONFIG_NR_CPUS:

This allows you to specify the maximum number of CPUs which this kernel will support. The maximum supported value is 512 and the minimum value which makes sense is 2. This is purely to save memory - each supported CPU adds approximately eight kilobytes to the kernel image.

I know SGI has systems running 4096 CPUs with SUSE Linux.

Re:One question? by couchslug · 2010-11-21 23:06 · Score: 1

"I mean having a 12' tall Toyota Hilux or a 1,000 core computer has to be BYOV, Bring Your Own Vibrator time."

I, for one, find that combination vaguely arousing.

--
"This post is an artistic work of fiction and falsehood. Only a fool would take anything posted here as fact."

Re:Imagine by bertok · 2010-11-21 23:10 · Score: 4, Interesting

depends on X86_64 && SMP && DEBUG_KERNEL && EXPERIMENTAL

And I believe you can crank that dial all the way up

Also consider this: the number of cores in my desktop is doubling every year or two (and this is with a single core chip), 6 and 8 cores are cheap now, so we'll be at 1024 in roughly 7-14 years which makes sense because the GHz war is done and simply making more cores is relatively cheap (once you have the interconnect making a bigger CPU isn't all that hard).

Don't you worry, the GHz war is not done!

There's talk of exotic materials (SiC, diamond, etc...) going to 10 GHz. If someone figures out how to make the Rapid Single Flux Quantum digital chips with high temperature superconductors, then we may seriously start to see 1 THz clock speeds in practical computers, using extreme Peltier cooling to get the CPU core down to cryogenic temps.

Re:Imagine by TheRaven64 · 2010-11-21 23:55 · Score: 4, Interesting

Pretty much anything that I've written in Erlang uses (at least) a few thousand concurrent processes. I've never tried running it on more than a 64-core machine, but when I moved stuff from my single-core laptop to a 64-core SGI machine the load was pretty evenly distributed.

It's pretty easy to write concurrent code that scales as long as you respect one rule: No data may be both mutable and aliased. You can do this in object-oriented languages with the actor model, but languages like Erlang enforce it for you (at the cost of a few redundant copies).

--
I am TheRaven on Soylent News

In the near future... by rebelwarlock · 2010-11-21 23:59 · Score: 4, Funny

I will need to buy a pair of sunglasses, and crush them when I find that the new Intel processor has over 9000 cores.

Why is 8192 a hard limit? by Joce640k · 2010-11-22 00:11 · Score: 1

Isn't it just a #define in the source code?

--
No sig today...

Re:Why is 8192 a hard limit? by TheRaven64 · 2010-11-22 00:36 · Score: 3, Informative

The kernel needs some data structures per processor. 8192 means it needs a 15-bit index for them. I'm not certain about the Linux kernel, but in other kernels it's quite common for this to be squeezed in to other values for various reasons, so adding more processors requires you to either increase the size of other data structures (often ones designed to be exactly one word long). Not impossible, but more effort than just changing a constant.
The reason for the limit in the Windows NT kernel is that various things use bit masks with processor IDs as the indexes. For example, when defining processor affinity set you have an n-bit bitfield (one bit per supported processor), with the bit set if the thread is allowed to run on that processor. At 256 bits (the current limit for Windows), these are already pretty large to scan (especially since the kernel isn't allowed to use SSE instructions, meaning that it's potentially got to be 4 64-bit lsb-tests to find the next core to use).

--
I am TheRaven on Soylent News
Re:Why is 8192 a hard limit? by TheRaven64 · 2010-11-22 02:13 · Score: 1

Bah! All of the number keys are close together.

--
I am TheRaven on Soylent News

But does it do full-screen flash video? by jonaskoelker · 2010-11-22 00:20 · Score: 1

Obligatory XKCD ref: http://xkcd.com/619/

RISC has downsides... by Otis_INF · 2010-11-22 00:22 · Score: 1

Because of the limited number of instructions, you have more instructions for a logical operation, e.g. multiply (although many risc cpu's have that operation), so this means you have to load more bytes from ram to do the same thing as a CISC instruction with lesser bytes than the whole piece of code for the risc. As cpu speed vs. ram / bus speed is skewed, it's more efficient to have instructions which take maybe a bit more bits, but on average they don't really take that much more and have microcode on-die to handle them, instead of having to load alot of risc instruction bytes from ram for doing basic operations a cisc can do through microcode. As long as the memory speed/busspeed is not exactly the same as the cpu speed (like on the ps3 where memory/bus runs at 3ghz, equal to the cpu) but slower, risc isn't always more optimal.

--
Never underestimate the relief of true separation of Religion and State.

Re:RISC has downsides... by kohaku · 2010-11-22 00:41 · Score: 2, Informative

It's more efficient to have instructions which take maybe a bit more bits, but on average they don't really take that much more and have microcode on-die to handle them
Well that would be true, but the really complex x86 instructions are rarely used, so you're not really adding much in the way of code density, and you have to add a lot of hardware complexity to decode it. Not only that, more complex instructions mean bigger pipelines which mean bigger branch penalties.

Re:Imagine by LWATCDR · 2010-11-22 00:43 · Score: 1

Well the killer app really is video transcoding. One thing holding that back is the DMCA. I should have the option to transcode a DVD or BlueRay and put it on my mobile device, netbook, or tablet as simply as I do CDs. Yes I can get Handbrake but I am talking about with iTunes, Zune, or the any other mainstream software package.
What we want is to have that ripped and transcoded in just a few under five minutes.
But other than that you are correct most users have reached good enough a while ago. What everyone but the manufactures want is cheaper and more power efficient.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.

For us not at SC10 by Eladith · 2010-11-22 01:10 · Score: 3, Informative

The paper referenced in the arcticle can be found here.

Fascinating that MPI works that well unmodified.

sounds like a coarse-grained FPGA haha! by aaronpeacock · 2010-11-22 01:14 · Score: 1

sounds like a coarse-grained FPGA haha!

Re:Imagine by chrysrobyn · 2010-11-22 01:37 · Score: 5, Interesting

Don't you worry, the GHz war is not done! There's talk of exotic materials (SiC, diamond, etc...) going to 10 GHz. If someone figures out how to make the Rapid Single Flux Quantum [wikipedia.org] digital chips with high temperature superconductors, then we may seriously start to see 1 THz clock speeds in practical computers, using extreme Peltier cooling to get the CPU core down to cryogenic temps.

The GHz war is over. The speed of light won. A long time ago, it stopped being "all about the transistor" and started being "all about the wires". IBM won the race to copper in 180nm (back when it was 0.18um), and that helped make those technologies even better, but about the time we hit 90nm, semiconductors were "fast enough", or even by some measurements stopped being able to speed up. Since then, almost all speed increases have been largely (but not exclusively) due to the transistors getting smaller, reducing the distance wires need to go.

The RC delay of wires is the major problem. R isn't going to be getting much better than copper. Silver has a lower resistance by a little bit, but it's too reactive to be used anywhere real. In these geometries, any alloy would be insufficiently mixable to be reliable, to say nothing about more exotic materials (like ceramics). There's some room for improvement in the dielectric (the "C"), but by the time you make a box with corners covering water permeability, thermal coefficient of expansion close to the wires, mechanical properties friendly to sub micron manufacturing, you have to concede you're not going to be able to get more than 20% faster there (and that we could dispute separately).

Take a cache. The slowest path is having a memory cell read. That tiny little device needs to have a measurable change in voltage on the bitlines, and be sensed by a sensing structure. That sensing structure has nothing to do with storage, so it's pure overhead and thusly you want as few of them as possible. Can you have it 16 bits away? 32? The days are gone that it was 64 bits away for any meaningful performance. There's nothing you can do to the characteristics of that little device (which needs to be minimum feature size to maximize the density of the cache) to dominate over the characteristics of the bitline he's trying to affect.

Take a data path. Even if 95% of your data is highly predictable, easily pipelined stuff with local signals, your critical path is going to involve signals from other areas of the chip, and they're going to have to be rebuffered and trucked from hundreds of microns away. No giant buffer in the history of man can dominate over a long distance wire. The signal will show up "eventually".

3GHz is a good place to stop. We make it to 4GHz with compromises in power, but beyond that and you're dedicating so much of your chip to rebuffering that you're blowing a lot of power on that. At that point, your pipeline is so many stages that branch mispredicts are very painful. You're devoting so much of your cycle time to setup and holds for your latches that you're going to be embarassed at how little work you can do in each cycle.

1 THz clock speeds are on their way, and maybe even higher. But they're not useful to CPUs or GPUs. They're useful for more exotic applications, primarily technology demonstrations.

Oracle by mjwx · 2010-11-22 01:38 · Score: 1

I hear a truckload of kleenex's just got delivered into Ellisons office when he heard this news.

--
Calling someone a "hater" only means you can not rationally rebut their argument.

Re:Remember the last couple of times this happened by hythlodayr · 2010-11-22 01:46 · Score: 1

There's a world of difference between massive # of regular cores--which, if harder to program for is well-understood--and the Itanium, which introduced a whole new concept with its EPIC architecture. The EPIC architecture seemed like a good idea--let the compiler take care of most of the instruction re-ordering, and get rid of branch predictions where at all possible by introducing speculative instructions in its stead. But as it turned out, writing a good compiler for this architecture is hard if not impossible...

Re:Imagine by markov_chain · 2010-11-22 01:50 · Score: 1

I don't know, Intel is making money hand over fist selling Xeons used in data center blades. A 1k-core chip would fit in quite nicely there. As for the desktop good-enough stuff, that's what the Atoms are for ;)

--
Tsunami -- You can't bring a good wave down!

Re:Imagine by Eladith · 2010-11-22 01:54 · Score: 1

Erlang with it's CSP-style message passing would seem to fit this chip perfectly, as well as Go for example. Atleast if thread-like constructs of those languages would run across the separate operating system instances on each SCC core.

Distributed Plan 9 system sounds like a good match also. Communication with pipes from core to another should be quite fast and programs can still be built as serial filters. Work on Barrelfish is somewhat interesting too.

But can you keep adding memory links? and IO links by Joe+The+Dragon · 2010-11-22 02:14 · Score: 1

But can you keep adding memory links? and IO links?

As 1000 cores may be cool but to make full use you may need 6-12+ ram channels and maybe 2+ QPI links. But RAM is more needed then IO some times. But if you are working with a lot of data then you may need 1 QPI link just to the SDD bank / raid system.

SCC by tjlaxs · 2010-11-22 02:29 · Score: 1

According to Intel it's Single-chip Cloud Computer. :P

--
tlax says: "Lol".

Nobody wants to put more cores on a chip? by reiisi · 2010-11-22 02:30 · Score: 1

The article is talking about targeting 1000 cores per chip (in x86 made efficient by fancy translating filters that consume chip real estate worse faster than Hummers consume gas).

Man, you're insane.

And I guess you don't believe in dust. Or maybe you don't believe testing processors costs money.

--
Computer memory is just fancy paper, CPUs just fancy pens with fancy erasers; the 'net is just a fancy backyard fence.

Re:Nobody wants to put more cores on a chip? by Arlet · 2010-11-22 03:43 · Score: 2, Interesting

The '93 era Pentium they're talking about only has 3 million transistors, and only a fraction are needed to handle the x86 instruction set. Current transistor count goes into the billions, so as far as real estate goes, you can put 1000 Pentium class cores on a single die, despite the x86 translations.
Of course, the whole concept of a 1000 cores running on a single die is only going to serve a small niche of applications.
Re:Nobody wants to put more cores on a chip? by reiisi · 2010-11-26 12:38 · Score: 1

Uhm, yeah, you can put a thousand cores on a single die. (And, if we are talking about ARM cores, you can go for an order of magnitude more.)
You do understand that semiconductor fabs do not test every part that they produce, because of the costs such activities impose on the production processes? You do understand why? and how they get around the problems that spot-checking would seem to induce?
And you do understand that theory doesn't quite match reality?
You do understand the costs in selling a whole die in a package? Or even, say, a quarter die?
You do understand the additional costs imposed by having to test every core on the die? Or, as an alternative, having to build comprehensive self-test functions into the cores? (You don't want just one test circuit for the entire die, that doesn't buy you anything.)
And the power problems ...
When you start talking about 1000 core processors, you are talking about building supercomputers in silicon, and I don't care how seriously confused you are about what a wonderful engieering tradeoff the x86 architecture was/is, if you are going to that kind of trouble, well, ...
The reason you see quite a few x86 clusters in the top 500 list these days (aside from INTEL cheating at the benchmarks, which, yes, they are doing, and it will come around and bite them sooner or later) is that INTEL has captured all the "best" engineers and put patents (of dubious quality) on enough pieces of all the leading edge tech that they can force the rest of the industry to stay one generation behind them.
Oh, and, by the way, the x86 supercomputers are not really executing traditional x86 code, for the most part, any more, so it kind of doesn't make sense to call them x86 in the top 500, really.
That means that the current INTEL x86 are competing against a pack of other CPUs that are a generation behind them. And still failing.
If the rest of the CPUs were receiving the same amount of engineering resources that the x86 is, there would have been no reason for Apple to switch. (Pick up x86/64 as an an option for the Macintosh desktop, yes, along with setting Darwin up as a proper project where we could participate on equal terms with Apple, but that's another issue altogether.) The 3GHz barrier would be broken on the desktop, but not by x86.
One good thing that comes out of all this, Apple didn't try to put any of the trimmed-down PPCs in the iPod/Phone. The ARM architecture is getting the attention it deserves, even if the ColdFire (new68k) architecture is unfairly being mostly set aside.

--
Computer memory is just fancy paper, CPUs just fancy pens with fancy erasers; the 'net is just a fancy backyard fence.

Seriously? by The+Hatchet · 2010-11-22 02:35 · Score: 1

How about we make a balance. I say a roughly logrithmic curve of processors/power of each processor ratio. Take 1 very, very powerful core, then 2 cores half the power of that, then 4 cores half the power of those, then 8, then 16, then 32, then 64, then 128, then 256, then 512. At that point, you have over 1000 cores, and have the ability to do anything you want with ridiculous speed and power, be it rendering thousands of simple tasks, or burning through a single mammoth thread, and everything in between.

--
Where is the mod rating for "scary"? Also, ...

Re:Seriously? by Twinbee · 2010-11-22 04:46 · Score: 1

Yes, I've always thought something like that would be theoretically best. In a similar vein, we could do the same for cache/memory. Have tiny amounts right next to the registers, and double the amount of memory for greater caches, but have half of those, and then 4x, and have only a quarter of those ad infinitum, up to the very largest pool of memory. I think something like that could be infinitely scalable, though I don't know too much about CPU designs, so I could be way off base.

--
Why OpalCalc is the best Windows calc
Re:Seriously? by mpsheppa · 2010-11-22 09:45 · Score: 1

Intel is already doing this to some extent with their Turbo Boost technology. You can run one core at the fastest clock rate allowed by the heat envelope of the CPU, two cores slightly slower or four even slower cores. It certainly isn't taking it to the extent that you suggest, but it is heading in that direction.
With their next generation Sandy Bridge architecture they are taking it a bit further and if the cores are cooler because they have been running idle for a while then it lets them run even faster for a short period of time until they head up again.
It is all about heat these days and in order for all those slow cores to be helpful, they need to have significantly better computation per watt ratio than the big fast cores AND we need OSs and/or hypervisors that understand how to use a mix of fast and slow cores.

compiling out all the conditional paths ... by reiisi · 2010-11-22 02:36 · Score: 1

Still have compiler writers that don't understand that unrolling code is not usually a real win, overall.

And the Itanium was designed for exactly that kind of optimization, as if a compiler is always supposed to be able to predict execution path in real-time execution.

Kind of like the time I tried to write a user interface in CoBOL.

--
Computer memory is just fancy paper, CPUs just fancy pens with fancy erasers; the 'net is just a fancy backyard fence.

Actually, this time they are barking up the right by reiisi · 2010-11-22 02:38 · Score: 1

err, barking up the right tree.

But they are still barking.

x86!

Marketing magic will always prevail over reality!

(That's what Moore's law really said.)

--
Computer memory is just fancy paper, CPUs just fancy pens with fancy erasers; the 'net is just a fancy backyard fence.

Re:Imagine by BitZtream · 2010-11-22 02:40 · Score: 1

and it won't help out playing games or even building kernels.

Then your clustering software sucks ass. I do distributed builds using xcode/xgrid rather often. Of couse, with a 1000 cores ... you'd just make -j 2000 and accomplish the same thing without some silly cluster.

, I know who would be first to support all those cores, and it doesn't come from Redmond (or their offshore outsourced developers).

I hope you realize just because you can't buy a boxed version of Windows from Staples/BestBuy/Whatever that supports more processors, versions supporting FAR more processors already exist from Redmond for custom hardware ... which is what you start talking about when you start talking 256 processors, its pretty much all 'custom' even when its a pretty generic version of 'custom'

--
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager

testament to what? by reiisi · 2010-11-22 02:52 · Score: 1, Insightful

I'd call it more of a testament to how much intel's fanatacism can induce them to waste all the benefits of Moore's law supporting baggage that was unnecessary when the x86 was "invented".

Just for the marketing department's black magic.

Instruction efficiency? Compact code? There are numerous processors that wax the floor with x86 in those departments, but marketing department's black magic killed the market.

Magic? It's all parlor tricks, you know, pay a researcher here to slip a little excess code in a tight loop on that 68k "benchmark", that sort of thing. The problem with the old saw about magic being indistinguishable from advanced tech is that magic is not about real results. Magic is about illusion. The confusing point is that illusion can be turned into reality with some effort.

In the x86 case, it was a huge lot of effort justified by a huge load of hubris and the needs of the black magic department, a vicious cycle.

x86 is a significant contributor to global warming (which is part of the reason some people want to deny the reality of human impact on the climate changes).

--
Computer memory is just fancy paper, CPUs just fancy pens with fancy erasers; the 'net is just a fancy backyard fence.

Intel always by RachelloAmando · 2010-11-22 03:23 · Score: 1

has interesting ideas and it is difficult to follow them :)

Biggest problem and a fix... by Panaflex · 2010-11-22 03:44 · Score: 2, Interesting

IMHO the biggest problem with these multi-core chips is the lock latency. Locking in heap all works great, but a shared hw register of locks would save a lot of cache coherency and MMU copies.

A 1024 slot register with instruction support for mutex and read-write locks would be fantastic.

I'm developing 20+Gbps applications - we need fast locks and low latency. Snap snap!!!

--
I said no... but I missed and it came out yes.

Paraphrasing Torvalds... by menkhaura · 2010-11-22 03:47 · Score: 2, Insightful

Talk is cheap, show me the cores.

--
Stupidity is an equal opportunity striker.
Fellow slashdotter Bill Dog

Re:Imagine by hannson · 2010-11-22 04:03 · Score: 1

No data may be both mutable and aliased

Perhaps a little off topic but do you know an online article that explains this in detail? (I'm writing my first concurrent server at the moment (in Go) and could use any information on the topic)

Re:Imagine by Twinbee · 2010-11-22 04:08 · Score: 1

Is 1Thz the best we can hope to get then?

--
Why OpalCalc is the best Windows calc

Re:Imagine by TheThiefMaster · 2010-11-22 04:19 · Score: 1

As point of comparison, the "Radeon HD 5970" graphics card has two 1600-core processors.

Re:Imagine by TheRaven64 · 2010-11-22 04:22 · Score: 2, Interesting

I don't know what extra detail you need - the rule should be pretty self explanatory. If something is shared between two or more threads, it should be immutable. If something is mutable, only one thread / process should hold references to it.

The only exception to this rule is explicitly synchronised communication objects (message queues, process handles, and suchlike). If you follow this rule, then the only concurrency problems that you will have are caused by high-level design problems, rather than by low-level implementation problems.

Erlang enforces this by only having one mutable object: the process dictionary, which is only accessible by the process that owns it. Everything else is immutable.

--
I am TheRaven on Soylent News

Re:Imagine by Culture20 · 2010-11-22 04:24 · Score: 1

diamond's not very exotic. just sayin'

Whats the point if Photoshop is only 2 processors by cdpage · 2010-11-22 04:42 · Score: 2, Insightful

Photoshop has been stuck at 2 processors for Way too long. Software companies have been lagging behind hardware far too long. Until I see See more software taking advantage of cores of more than 1 or 2... I'm not wasting money on them.

Benchmarks by Chemisor · 2010-11-22 04:47 · Score: 2, Insightful

According to benchmarks, a functional language like Erlang is slower than C++ by an order of magnitude. Sure, it can distribute processing over more cores, which is the only thing that enabled it to win one of the benchmarks. I suspect that was only because it used a core library function that was written in C. So no, if you want to write code with acceptable performance, DON'T use a functional language. All CPU intensive programs, like games, are written in C or C++; think about that.

Re:Benchmarks by Reservoir+Penguin · 2010-11-22 05:26 · Score: 1

Most real cpu intensive applications applications are written in Fortran :) And you example is better used to show that c and friends are not well suited for the kind of things this chip is made for since most games can barely use 2 cores.

--
US-UK-Israel: The real Axis of Evil
Re:Benchmarks by Chemisor · 2010-11-22 05:57 · Score: 1

> Most real cpu intensive applications applications are written in Fortran
Nobody outside the academia uses Fortran. At least, nobody under retirement age. And that same site also has a comparison of Fortran with C++, and C++ wins hands down. In one case, C++ is 26 times faster.
Re:Benchmarks by Chemisor · 2010-11-22 12:06 · Score: 1

> I don't know what's more stupid, your generalization of all functional languages by using Erlang as an example
And it's stupid because Erlang is not a functional language? Or is it just because the conclusion does not support your prejudices? Will you switch to promoting other languages until you find one that supports your argument?
> Look at Haskell, OCaml, or F# for much faster comparisons
Yes, do look. The benchmark page I linked to has all those too, and they are all just as slow. Get out of your academic ivory tower already; nothing beats C++ in speed.
> or your comparison to video games which isn't even relevant - both because of the types of hardware games target
Oh, please! Games target the very type of hardware that 99.99999% people have. Maybe you'll be able to find some weird LISP-in-hardware computer which runs a functional language faster than C++, but who in hell is going to care about it? For the hardware pretty much everybody has, C++ is the clear choice, as the benchmarks show.
> An Erlang or Haskell program can (and will) make calls to C-linkage libraries.
Oh, so you admit that you know your functional languages are so dead slow you need C for the important parts? And then you have the gall to talk about how fast functional languages are? Gees... If you wrote your program in C++ to begin with, you wouldn't need to link to stuff written in anything else.
And yes, there are reasons other than speed for using something other than C++; you should not be writing shell scripts in it, for instance. But when you need a general purpose language, C++ is the only sane choice.
Re:Benchmarks by Twinbee · 2010-11-27 10:26 · Score: 1

You say not to write shell scripts in it, but surely there's a library (Boost maybe?) that provides a wrapper to make it just as easy as any other language?

--
Why OpalCalc is the best Windows calc

Re:Imagine by mikael · 2010-11-22 05:21 · Score: 1

Problem is, those extremely complex NURBS surfaces have all sorts of trims around the perimeter and in the surface itself to match bolt holes, and other components. It's not just a simple regular square or triangular patch.

Calculating the local tangent space of each point of a regular or N-sided patch, isn't too difficult (tangent, normal, binormal), it's all the trimming that takes up the time. Just a single bolt with a spiral thread is going to generate a whole bucketload of triangles per revolution of that thread.
Another complication is that the CATIA file format isn't simply a geometry files, it's more of relational database entry, where everything is cross-referenced to the manufacturer, specification standard, measurement units, required sub-parts. That way, you just have one file, it pulls in everything else that you need to view that one part.

--
Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads

Re:Remember the last couple of times this happened by mikael · 2010-11-22 05:36 · Score: 1

I always remember the Intel i860. Another attempt to create a graphics processor (or coprocessors as they were called back then). It had special instructions for perfoming combo Z-buffer and color buffer test and writes as well as vector processor instructions. They made it into early SGI workstations.

The full-page glossy advert pages of BYTE magazine used to have these pictures of really impressive (at the time) systems with transputer/i860/TMS34020 boards. Some with their own network and hard disk drive ports (the PC was too slow at the time to handle the data transfer). But every time a board came out, six months later, CPU's would have caught up and these boards/chips would become known as "graphics deaccelerators".

--
Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads

Re:Imagine by PingPongBoy · 2010-11-22 05:39 · Score: 2, Funny

Why would you care to see one on your desktop? Do you have any use for one?

You got that right. I've never used more than 639 K of RAM either.

--
Know your pads. One time pad: good for cryptography. Two timing pad: where to take your mistress.

Re:Imagine by iinlane · 2010-11-22 07:08 · Score: 1

I'd like to see a 1k core machine on my desktop, but that's beyond the practical limits of any software currently available.

After 1k core machine becomes commonplace it will be required to boot windows.

Not just message passing, but synchronous-only by wazzap123 · 2010-11-22 07:12 · Score: 1

There's a good analysis over at the daily circuit that dissects what is being suggested by Intel, who is not just talking the death of within-node shared memory, but also of intercore asynchronous message passing. I liked the comparison with sending emails with large video attachements, instead of YouTube links, and then requiring the recipient to clear the inbox before a new email can be received. http://www.dailycircuitry.com/2010/11/intel-talks-kilocore-processors.html

Re:Imagine by iinlane · 2010-11-22 07:15 · Score: 1

Never say never; there's a whole third dimension to explore. I have no doubt in my mind there will be solution for these problems but to reach there we must think out of the box.

Re:Imagine by Reservoir+Penguin · 2010-11-22 08:11 · Score: 1

Pretty amazing achivement for a microsoft os. Does 64 socket limit include named pipes or is it just tcp/ip sockets?

--
US-UK-Israel: The real Axis of Evil

Re:Imagine by hannson · 2010-11-22 11:00 · Score: 1

Thanks a lot for the clarification.

Re:Imagine by pyalot · 2010-11-22 11:19 · Score: 1

I don't know what you're responding to, but it's certainly an interesting view, I take it you speak of tessellation shading. As it happens I did a bit of tinkering around with that (http://codeflow.org/entries/2010/nov/07/opengl-4-tessellation/). It's true that generic displacement mapping is more expensive, since you need to calculate the smoothed mesh before you can displace it. However, even though it makes things near the point of view more expensive to display (but also better looking), it's a great boon at scaling detail down smoothly, so it actually ends up being faster and better looking (because most further away things have just the detail required, and aren't hugely overdrawn).

--
Experiments and other stuff

Re:Imagine by Vegemeister · 2010-11-22 13:44 · Score: 2, Funny

Dammit dude. Blow the dust out of your case.

Wing Commander by lymond01 · 2010-11-22 17:52 · Score: 1

I remember installing Wing Commander on a Pentium processor. Normally it ran on a 486. It sped up the game. By about 20 times. You launched and you were half a map away from the combat before you could turn around. When you were pointed at it you held down the trigger and flew threw microsecond long explosions. Then you were half the map away again. You got used to it though.

You'd sort of expect that, with all the processor enhancements since, that Microsoft Office would open faster than in 1995. But you know what, that speed of opening scaled fairly well -- just a few seconds then, a few seconds now. Not sure what happened with Office XP. I'm thinking 1000 cores won't save my Firefox from taking up 500 MB of memory so I'm still out of luck there.

Yeesh. by gottabeme · 2010-11-22 18:27 · Score: 1

Patience, young whatever-you-are.

No, you're right: Intel should go ahead and start building a one million-core chip now. We need it now to...uh....

--
"Those who consume the bulk of goods are those who make them. We must never forget this secret of our prosperity."

Re:Imagine by hairyfeet · 2010-11-22 20:10 · Score: 1

Nope, sorry, not even close. How many multicore CPUs were sold by Crysis? My guess very damned few, maybe even none, as those that bought Crysis were just like my "Must win teh benchmarkz lol!" ePeen customers.

No what I'm talking about is something like what Visicalc did in the 80s, or video playback in the 90s. Both of these were jobs that A.)large masses of people wanted to do, and B.)large masses could see instant benefit from.

The simple fact is the big games right now are NOT ePeen games, but MMOs. And those run just fine on a late model P4 with a $50 AGP card. Thanks to the PC gaming development being tied to the consoles which look like they may go another 5 years without refresh gaming simply is no longer the app that drives technology and as proof see Eyefinity and CUDA. Both techs were cooked up as desperate attempts to push GPUs that simply wouldn't sell otherwise. This is also why the "sweet spot" in terms of sales is no longer the $250 GPU, but the $100 one. There simply isn't enough content requiring the $250 one to make it worth the extra expense for the majority.

So if Intel and AMD want true multicore processing for the masses then they need to be pushing for that next killer app, one which will spur adoption. Because right now there simply isn't anything I've seen that would upsell most folks on it. Just look at how many shitty Pentiums and Atoms Intel sells each year VS their top o' the line Core series. I bet their cheapo shitty chips sell by an order of magnitude higher than the good chips, simply because on day to day apps most folks won't tell a difference. And GPUs are highly specialized vector processors, and TFA are talking about X86, so the comparison isn't even apt.

--
ACs don't waste your time replying, your posts are never seen by me.

Slashdot Mirror

Intel Talks 1000-Core Processors

245 of 326 comments (clear)