Intel Says to Prepare For "Thousands of Cores"

← Back to Stories (view on slashdot.org)

Intel Says to Prepare For "Thousands of Cores"

Posted by ScuttleMonkey on Wednesday July 2, 2008 @08:42AM from the viva-la-coding-revolucion dept.

Impy the Impiuos Imp writes to tell us that in a recent statement Intel has revealed their plans for the future and it goes well beyond the traditional processor model. Suggesting developers start thinking about tens, hundreds, or even thousand or cores, it seems Intel is pushing for a massive evolution in the way processing is handled. "Now, however, Intel is increasingly 'discussing how to scale performance to core counts that we aren't yet shipping...Dozens, hundreds, and even thousands of cores are not unusual design points around which the conversations meander,' [Anwar Ghuloum, a principal engineer with Intel's Microprocessor Technology Lab] said. He says that the more radical programming path to tap into many processing cores 'presents the "opportunity" for a major refactoring of their code base, including changes in languages, libraries, and engineering methodologies and conventions they've adhered to for (often) most of the their software's existence.'"

15 of 638 comments (clear)

Min score:

Reason:

Sort:

Re:Not Sure I'm Getting It by Delwin · 2008-07-02 08:46 · Score: 5, Informative

Because each core is no longer task switching. Once you have more cores than tasks you can remove all the context switching logic and optimize the cores to run single processes as fast as possible.

Then you take the tasks that can be broken up over multiple cores (Ray Tracing anyone?) and fill the rest of your cores with that.
Already Happening by sheepweevil · 2008-07-02 08:55 · Score: 3, Informative

Supercomputers already have many more than thousands of cores. The IBM Blue Gene/P can have up to 1,048,576 cores. What Intel is probably talking about is bringing that level of parallel computing to smaller computers.
Re:Ok.. so how do I do that? by Phroggy · 2008-07-02 09:02 · Score: 4, Informative

A year or so ago, I saw a presentation on Thread Building Blocks, which is basically an API thingie that Intel created to help with this issue. Their big announcement last year was that they've released it open-source and have committed to making it cross-platform. (It's in Intel's best interest to get people using TBB on Athlon, PPC, and other architectures, because the more software is multi-core aware, the more demand there will be for multi-core CPUs in general, which Intel seems pretty excited about.)

--
$x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
$x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
Re:That's all well and good..... by pimpimpim · 2008-07-02 09:26 · Score: 3, Informative

bingo. The problem is there. I've followed an introductory course on parallel programming (not saying I'm an expert, though), and while the idea of multiple processor programming is fairly simple, the implementation is amazingly difficult and painful.
Example: "race condition" Say processor one is trying to find the optimal value of variable A, and processor two is doing something different, but calling some subfunction which changes variable A, then processor one might keep on running forever.
The other main problem is the deadlock: Processor one needs the final result of variable B to calculate variable A, but processor two needs the final result of variable A to calculate B. Both processors will come to a standstill, and the program is halting forever.
For simple programs, these things are relatively easy to troubleshoot. But for your huge program package with hundreds of modules, it is almost impossible to know what is happening.
Actually, it is the duty of intel and co. to find a way to prevent these situations, but also there, what kind of genius is able to program an automated debugger that can find deadlocks and race conditions.

--
molmod.com - computing tips from a molecular modeling
Re:Not Sure I'm Getting It by Brian+Gordon · 2008-07-02 09:43 · Score: 3, Informative

Are you crazy? Context switches are the slowdown in multitasking OSes.
Re:Not Sure I'm Getting It by k8to · 2008-07-02 09:56 · Score: 5, Informative

True but misleading. The major cost of task switching is a hardware-derived one. It's the cost of blowing caches. The swapping of CPU state and such is fairly small by comparison, and the cost of blowing caches is only going up.

--
-josh
Re:The thing's hollow - it goes on forever by kdemetter · 2008-07-02 10:26 · Score: 3, Informative

2001 : A Space Odyssey , by Arthur C. Clarke.
Great book.

--
Slipping shoelaces ?
Re:Not Sure I'm Getting It by blahplusplus · 2008-07-02 10:50 · Score: 4, Informative

"Because each core is no longer task switching. Once you have more cores than tasks you can remove all the context switching logic and optimize the cores to run single processes as fast as possible.
Then you take the tasks that can be broken up over multiple cores (Ray Tracing anyone?) and fill the rest of your cores with that."
Unfortunately all this is going to lead to bus and memory bandwidth contention, you're just shifting the burden from one point to another. Although their is a 'penalty' for task switching, there is an even greater bottleneck at the bus and memory bandwidth level.
IMHO intel would have to release a cpu on a card with specialized ram chips and segment the ram like GPU's do to get anything out of multicore over the long term, ram is not keeping up and the current architecture for PC ram is awful for multicore. CPU speed is far outstripping bus and memory bandwidth. I am quite dubious of multi-core architecture, there is fundamental limits of geometry of circuits. I'd be sinking my money into materials research not glueing cores together and praying CS and math guys come up with solutions that take advantage of it.
The whole of human history of engineering and tool use, is to take something extremely complicated and offload complexity, and compartmentalize it so that it's mangable. I see the opposite happening with multi-core.
Re:Databases and implimentation-neutrality by Shados · 2008-07-02 10:54 · Score: 4, Informative

By "a lot of processing can potentially be converted into DB queries", what you discovered is functional programming :) LINQ in .NET 3.5/C# 3.0 is an example of functional programming that is made to look like DB queries, but it isn't the only way. It is a LOT easier to convert that stuff and optimize it to the environment (like how SQL is processed), since it describes the "what" more than the "how". It is already done, and one (out of many examples) is Parallel LINQ, which smartly execute LINQ queries in parallel, optimized for the amount of cores, etc. (And I'm talking about LINQ in the context of in memory process, not LINQ to SQL, which simply convert LINQ queries into SQL ones).
Functional programming, tied with the concept of transactional memory to handle concurency, is a nice medium term solution to the multi-core problem.
Re:Not Sure I'm Getting It by skulgnome · 2008-07-02 10:57 · Score: 5, Informative

No. I/O is the slowdown in multitasking OSes.
Re:Not Sure I'm Getting It by kramerd · 2008-07-02 12:23 · Score: 3, Informative

Girls like it when you buy them things. Or when you pretend to listen. And when you shower.
Re:Not Sure I'm Getting It by kesuki · 2008-07-02 12:46 · Score: 3, Informative

"Take, for instance, the huge success of mp3's. There was a time not so long ago when people were limited to playing music off a physical CD. This wasn't because there was no desire amongst computer users to listen to digital files that could be stored locally or streamed off the internet. It was because computer users did not know yet that they had the desire to do it. But technology advanced to the point where a) processors became fast enough to decode mp3's in real time without using the whole CPU"
I started making mp3s with a 486 DX 75mhz
I could decode in real time on a 486 DX 75 as i recall encoding took a bit of time, and i only had a 3 GB HDD that had been an upgrade to the system...
Mp3s use a asynchronous encoding algorithm, more CPU to encode, than to decode, if your MP3 player doesn't run correctly on a 486, then it's because they designed in features not strictly needed to decode a MP3 stream.
Oh hey, I have an RCA Lyra mp3 player, that isn't even as fast as a 486, but the decoder was designed for mp3 decoding.
Ogg decoding uses a beefier decoder, that's half the problem getting ogg support in devices not made for decoding video streams.

--
https://www.gnu.org/philosophy/free-sw.html
Re:Not Sure I'm Getting It by Salamander · 2008-07-02 13:23 · Score: 5, Informative

Because each core is no longer task switching. Once you have more cores than tasks you can remove all the context switching logic and optimize the cores to run single processes as fast as possible.
OK, so now the piece that's running on each core runs really really fast . . . until it needs to wait for or communicate with the piece running on some other core. If you can do your piece in ten instructions but you have to wait 1000 for the next input to come in, whether it's because your neighbor is slow or because the pipe between you is, then you'll be sitting and spinning 99% of the time. Unfortunately, the set of programs that decompose nicely into arbitrarily many pieces that each take the same time (for any input) doesn't extend all that far beyond graphics and a few kinds of simulation. Many, many more programs hardly decompose at all, or still have severe imbalances and bottlenecks, so the "slow neighbor" problem is very real.
Many people's answer to the "slow pipe" problem, on the other hand, is to do away with the pipes altogether and have the cores communicate via shared memory. Well, guess what? The industry has already been there and done that. Multiple processing units sharing a single memory space used to be called SMP, and it was implemented with multiple physical processors on separate boards. Now it's all on one die, but the fundamental problem remains the same. Cache-line thrashing and memory-bandwidth contention are already rearing their ugly heads again even at N=4. They'll become totally unmanageable somewhere around N=64, just like the old days and for the same reasons. People who lived through the last round learned from the experience, which is why all of the biggest systems nowadays are massively parallel non-shared-memory cluster architectures.
If you want to harness the power of 1000 processors, you have to keep them from killing each other, and they'll kill each other without even meaning to if they're all tossed in one big pool. Giving each processor (or at least each small group of processors) its own memory with its own path to it, and fast but explicit communication with its neighbors, has so far worked a lot better except in a very few specialized and constrained cases. Then you need multi-processing on the nodes, to deal with the processing imbalances. Whether the nodes are connected via InfiniBand or an integrated interconnect or a common die, the architectural principles are likely to remain the same.
Disclosure: I work for a company that makes the sort of systems I've just described (at the "integrated interconnect" design point). I don't say what I do because I work there; I work there because of what I believe.

--
Slashdot - News for Herds. Stuff that Splatters.
Re:Not Sure I'm Getting It by Erich · 2008-07-02 15:17 · Score: 5, Informative
Single Address Space is horrible.
It's a huge kludge for idiotic processors (like arm9) that don't have physically-tagged caches. On all non-incredibly-sucky processors, we have physically tagged caches, and so having every app have its own address space, or having multiple apps share physical pages at different virtual addresses, all of these are fine.
Problems with SAS:
- Everything has to be compiled Position-independent, or pre-linked for a specific location
- Virtual memory fragmentation as applications are loaded and unloaded
- Where is the heap? Is there one? Or one per process?
- COW and paging get harder
- People start using it and think it's a good idea.
Most people... even people using ARM... are using processors with physically-tagged caches. Please, Please, Please, don't further the madness of single-address-space environments. There are still people encouraging this crime against humanity.
Maybe I'm a bit bitter, because some folks in my company have drunk the SAS kool-aid. But believe me, unless you have ARM9, it's not worth it!
--
-- Erich
Slashdot reader since 1997
Re:The thing's hollow - it goes on forever by dryeo · 2008-07-02 17:00 · Score: 5, Informative

And before they made it into a movie it was an interesting short story. http://en.wikipedia.org/wiki/The_Sentinel_(short_story)
If you'd like to read it, seems it is this PDF, http://econtent.typepad.com/TheSentinel.pdf

--
https://en.wikipedia.org/wiki/Inverted_totalitarianism