Panic in Multicore Land
MOBE2001 writes "There is widespread disagreement among experts on how best to design and program multicore processors, according to the EE Times. Some, like senior AMD fellow, Chuck Moore, believe that the industry should move to a new model based on a multiplicity of cores optimized for various tasks. Others disagree on the ground that heterogeneous processors would be too hard to program. The only emerging consensus seems to be that multicore computing is facing a major crisis. In a recent EE Times article titled 'Multicore puts screws to parallel-programming models', AMD's Chuck Moore is reported to have said that 'the industry is in a little bit of a panic about how to program multicore processors, especially heterogeneous ones.'"
Well, the most recent research into how the cortext works has some interesting leads on this. If we first assume that the human brain has a pretty interesting organization, then we should try to emulate it.
Recall that the human brain receives a series of pattern streams from each of the senses. These patterns streams are in turn processed in the most global sense--discovering outlines, for example--in the v1 area of the cortext, which receives a steady stream of patterns over time from the senses. Then, having established the broadest outlines of a pattern, the v1 cortext layer passes its assessment of what it saw the outline of to the next higher cortex layer, v2. Notice that v1 does not pass the raw pattern it receives up to v2. Rather, it passes its interpretation of that pattern to v2. Then, v2 makes a slightly more global assessment, saying that the outline it received from v1 is not only a face but a face of a man it recognizes. Then, that information is sent up to v4 and ultimate to the IT cortex layer.
The point here is important. One layer of the cortex is devoted to some range of discovery. Then, after it has assigned some rudimentary meaning to the image, it passes it up the cortex where a slightly finer assignment of meaning is applied.
The takeaway is this: each cortex does not just do more of the same thing. Instead, it does a refinement of the level below it. This type of hierarchical processing is how multicore processors should be built.
Can I have... errr... Two floating point, one generic math with extra cache and two RISC's.
If you mod this up, your slashdot background will turn into a beautiful sunset!
I've been doing some scientific computing on the Cell lately, and heterogeneous cores don't make life very easy. At least with the Cell.
The Cell has one PowerPC core ("PPU"), which is a general purpose PowerPC processor. Nothing exotic at all about programming it. But then you have 6 (for the Playstation 3) or 8 (other computers) "SPE" cores that you can program. Transferring data to/from them is a pain, they have small working memories (256k each), and you can't use all C++ features on them (no C++ exceptions, thus can't use most of the STL). They also have poor speed for double-precision floats.
The SPEs are pretty fast, and they have a very fast interconnect bus, so as a programmer I'm constantly thinking about how to take better advantage of them. Perhaps this is something I'd face with any architecture, but the high potential combined with difficult constraints of SPE programming make this an especially distracting aspect of programming the Cell.
So if this is what heterogeneous-cores programming means, I'd probably prefer the homogeneous version. Even if they have a little less performance potential, it would be nice to have a 90%-shorter learning curve to target the architecture.
The idea of having to use Microsoft APIs to program future computers because the vendors only document how to get DirectX to work doesn't exactly thrill me. I think panic is perhaps too strong a word, but sheesh...
> the speed the user experiences has not improved much.
User experience is not a useful metric for performance, unless you consider media encoding , decoding and rendering. 10 years ago I was running a P166, what kind of framerates would I get with a modern game using a software renderer? What kind of framerates would I get for decoding a HD video stream?
Do you seriously think a 12 year old P166 will provide a comparative user experience to a modern 8 core 3GHz machine? You're putting it down to "the OS's that run on them", which is interesting since user mode x86 emulation with QEMU runs W2K faster on my laptop than on the hardware I ran it on back in 1999.
As I demonstrated in my thesis a parallel application can be shown to have certain critical and less critical parts. An optimal processing platform matches those requirements. The remainder of the platform will remain idle and burn away power for nothing. One should wonder what is better: a 2 GHz processor or 2x 1 GHz processors. My opinion is that, if it has no impact on performance, the latter is better.
There is an advantage to a symmetrical platform: you cannot misschedule your processes. It does not matter which processor takes a certain job. On a heterogeneous system you can make serious errors: scheduling your video process on your communications processor will not be efficient. Not only is the video slow, the communications process has to wait a long time (impacting comm. performance).
nosig today
When we wrote the OpenAMQ messaging software in 2005-6, we used a multithreading design that lets us pump around 100,000 500-byte messages per second through a server. This was for the AMQP project.
Today, we're making a new design - ØMQ, aka "Fastest. Messaging. Ever." - that is built from the ground up to take advantage of multiple cores. We don't need special programming languages, we use C++. The key is architecture, and especially an architecture that reduces the cost of inter-thread synchronization.
From one of the ØMQ whitepapers:
We don't get linear scaling on multiple cores, partly because the data is pumped out onto a single network interface, but we're able to saturate a 10Gb network. BTW ØMQ is GPLd so you can look at the code if you want to know how we do it.
My blog
If you have 80 or more cores, I'd rather have 20 of them support specialty functions and be able to do them very fast (it would have to be a few (1-3) orders of magnitude faster than the general counterpart) and the rest do general processing. This of course needs the support of operating systems, but that isn't very hard to get. With 80 cores caching and threading models have to be rethought, especially caching - the operating system has to be more involved in caching than it currently is, because otherwise cache coherency won't be able to be done.
.NET will be much more popular as it is possible for them to take care a lot of these issues, which isn't or only possible in a limited way for languages like C and friends with static source code inspection.
This also means that programs will need to be written not just by using threads, "which makes it okay for multi-core", but with cpu cache issues and locality in mind. I think VMs like JVM, Parrot and
It takes a man to suffer ignorance and smile
Be yourself no matter what they say
This trend with multiple cores on the CPU is only an intermediate phase,
because it over saturates the memory bus, which is easy to remedy by
putting the cores on the memory chips, of which there are a number
comparable to the number of cores.
In other words, the CPUs will disappear, and there will be lots of smaller
core/memory chips, connected in a network. And they will be cheaper as well,
because they do not need so high a yeld.
Kim0
I'm curious how having specialized multi-core processors is different from having a single-core processor with specialized subunits. Ie, a single core x86 chip has a section of it devoted to implementing MMC, SSE, etc. Isn't having many specialized cores just a sophisticated way of re-stating that you have a really big single-core processor, in some sense?
You see? You see? Your stupid minds! Stupid! Stupid!
Every object (in the general sense, not necessarily the OO sense) may be either aliased or mutable, but not both.
Erlang does this by making sure no objects are mutable. This route favours the compiler writer (since it's easy) and not the programmer. I am a huge fan of the CSP model for large projects, but I'd rather keep something closer to the OO model in the local scope and use something like CSP in the global scope (which is exactly what I am doing with my current research).
I am TheRaven on Soylent News
I have read many times that some algorithms are difficult or impossible to multi-thread. I envisage the next logical step is a two socket motherboard, where one socket could be used for a 8+ core cpu running at low clock rate (e.g. 2-3Ghz) and another socket for a single core running at the greatest frequency achievable to the manufacturing process (e.g. x2 to x4 the clock speed of the multi-core) with whatever cache size compromises are required.
This help get around yield issues of getting all cores to work at a very high frequency and the related thermal issues . This could be a boon to general purpose computer that have a mix of hard to multi-thread and easy to multi-thread programs - assuming the OS could be intelligent on which cores the tasks are scheduled on. The cores could or could not have the same instruction sets, but having the same instruction sets would be the easy first step.
This also goes to the bloat - as programmers have typically stopped optimizing code. Thus there are more lines of code in delivered software - often having more and more abstraction layers in them, which doesn't help either. So the overall effect is that the software takes longer to do the same function.
In the end, despite the increase in processing power, the programs run as slow or slower than before. Numerous reasons for it. The GP of my original post in this thread is still correct.
Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)
It has been quite obvious to several people in the usenet news:comp.arch newsgroup that the future should give us chips that contain multiple cores with different capabilites:
:-(
As long as all these cores share the same basic architecture (i.e. x86, Power, ARM), it would be possible to allow all general-purpose code to run on any core, while some tasks would be able to ask for a core with special capabilites, or the OS could simply detect (by trapping) that a given task was using a non-uniform resource like vector fp, mark it for the scheduler, and restart it on a core with the required resource.
An OS interrupt handler could run better on a short pipeline in-order core, a graphics driver could use something like Larrabee, while SPECfp (or anything else that needs maximum performance from a single thread would run best on an Out-of-Order core like the current Core 2.
The first requirement is that Intel/AMD must develop the capability to test & verify multiple different cores on the same chip, the second that Microsoft must improve their OS scheduler to the point where it actually understands NUMA principles not just for memory but also cpu cores. (I have no doubt at all that Linux and *BSD will have such a scheduler available well before the time your & I can buy a computer with such a cpu in it!)
So why do I believe that such cpus are inevitable?
Power efficiency!
A relatively simple in-order core like the one that Intel just announced as Atom delivers maybe an order of magnitude better performance/watt than a high-end Core 2 Duo. With 16 or 80 or 256 cores on a single chip, this will become really crucial.
Terje
PS As other posters have noted, keeping tomorrow's multi-core chips fed will require a lot of bandwith, this is neither free nor low-power.
"almost all programming can be viewed as an exercise in caching"
Two is totally doable. I can fill two (or the equivalent of two) of my four cores.
Trouble is, filling four cores is quite a bit more iffy.
I'd like to ask a few related questions from a developer's point of view :
1) Is there a programming language that tries to make programming for multiple cores easier?
2) Is programming for parallel cores the same as parallel programming?
3) Is anybody aware of anything in this direction on the C++ front that does not rely on OS APIs?
Find a job you like and you will never work a day in your life.