Windows and Linux Not Well Prepared For Multicore Chips
Mike Chapman points out this InfoWorld article, according to which you shouldn't immediately expect much in the way of performance gains from Windows 7 (or Linux) from eight-core chips that come out from Intel this year. "For systems going beyond quad-core chips, the performance may actually drop beyond quad-core chips. Why? Windows and Linux aren't designed for PCs beyond quad-core chips, and programmers are to blame for that. Developers still write programs for single-core chips and need the tools necessary to break up tasks over multiple cores. Problem? The development tools aren't available and research is only starting."
Give us a year maybe two.
http://www.infoworld.com/archives/emailPrint.jsp?R=printThis&A=/article/09/03/20/Multicore_chips_pose_next_big_challenge_for_industry_1.html
So basically yet another tech writer finds out that a huge number of applications are still single threaded, and that it will be a while before we have applications that can take advantage of the cores that the OS isn't actively using at the moment. Well, assuming you're running a desktop and not a server.
This isn't a performance issue with regards to Windows or Linux, they're quite adept at handling multiple cores. They just don't need that much themselves and the applications run these days, individually, don't need much more than that either.
So yes, applications need parallelization. The tools for it are rudimentary at best. We know this. Nothing to see here.
Multiple virtual machines on the same piece of metal, with a workstation hypervisor, and intelligent balancing of apps between backends.
Multiple OSes sharing the same cores. Multiple apps running on the different OSes, and working together.
Which can also be used to provide fault tolerance... if one of the worker apps fails, or even one of the OSes fails, your processor capability is reduced, a worker app in a different OS takes over, use checkpointing procedures, and shared state, so the apps don't even lose data.
You should even be able to shutdown a virtual OS for windows updates without impact, if the apps that arise get designed properly...
...programmers are to blame for that
The development tools aren't available and research is only starting."
Stupid programmers! Not able to develop software without the tools! In my day we wrote our own tools - in the snow, uphill, both ways! We didn't need no stink'n vendor to do it for us - and we liked it that way!
Firstly, it's false on the face of it: Ubuntu is certified on Sun T2000, a 32-thread and Canonical is supporting it.
Secondly. it's the same FUD as we heard from uniprocessor manufacturers when multiprocessors first came out: this new "symmetrical multiprocessing" stuff will never work, it'll bottleneck on locks.
The real problem is that some programs are indeed badly written. In most cases, you just run lots of individual instances of them. Others, for grid, are well-written, and scale wonderfully.
The ones in the middle are the problem, as they need to coordinate to some degree, and don't do that well. It's a research area in computer science, and one of the interesting areas is in transactional memory.
That's what the folks at the Multicore Expo are worried about: Linux itself is fine, and has been for a while.
--dave
davecb@spamcop.net
I guess you could read it and find out...
seriously?
-Bucky
Languages like PHP/Perl, as a rule, are not designed for threading - at ALL. This makes multi-core performance a non-starter. Sure, you can run more INSTANCES of the language with multiple cores, but you can't get any single instance of a script to run any faster than what a single core can do.
I have, so, so, SOOOO many times wished I could split a PHP script into threads, but it's just not there. The closest you can get is with (heavy, slow, painful) forking and multiprocess communication through sockets or (worse) shared memory.
Truth be told, there's a whole rash of security issues through race conditions that we'll soon have crawling out of nearly every pore as the development community slowly digests multi-threaded applications (for real!) in the newly commoditized multi-CPU environment.
I have no problem with your religion until you decide it's reason to deprive others of the truth.
"The development tools aren't available and research is only starting"
Hardly. Erlang's been around 20 years. Newer languages like Scala, Clojure, and F# all have strong concurrency. Haskell has had a lot of recent effort in concurrency (www.haskell.org/~simonmar/papers/multicore-ghc.pdf).
If you prefer books there's: Patterns for Parallel Programming, the Art of Multiprocessor Programming, and Java Concurrency in Practice, to name a few.
All of these are available now, and some have been available for years.
The problem isn't that tools aren't available, it's that the programmers aren't preparing themselves and haven't embraced the right tools.
Too bad BeOS died. One of the axioms the developers had was 'the machine is a multi processor machine', and everything was built to support that.
Seems like they were 15 years ahead of their time. But, on the other hand, too late to establish an other OS in a saturated market. Pity, really.
get a mac..
I assume you're talking about Mac OS X 10.6 (Snow Leopard), whose Grand Central framework is supposed to add some tools to make Mac-exclusive multithreaded apps easier to program.
imagine software being developed for imaginary or speculatory hardware.
I think Sun called it "Java". It was run on emulators long before ARM and others came out with hardware-assisted JVMs such as Jazelle.
The quote presented in the summary is nowhere to be found in the linked article. To make matters worse, the summary claims that linux and windows aren't designed for multicore computers but the linked article only claims that some applications are not designed to be multi-threaded or running multiple processes. Well, who said that every application under the sun must be heavily multi-threaded or spawning multiple processes? Where's the need for a email client to spawn 8 or 16 threads? Will my address book be any better if it spans a bunch of processes?
The article is bad and timothy should feel bad. Why is he still responsible for any news being posted on slashdot?
Is TFA talking about the Linux or Windows thread and scheduling not good enough for 4+ cores (so your programs no matter how good designed will not benefit from more cores), about being damn hard to split, thread and join tasks, or both?
I understood the article to refer to the latter. The programming languages that are popular for desktop applications as of the 2000s don't have the proper tools (such as an unordered for-each loop or a rigorous actor model) to make parallel programming easy.
The /. summary of TFA is almost exquisitely bad. It's not Window or Linux that's not ready for multicore (as both have supported multi-processor machines for on the order of a decade or more), but rather the userspace applications that aren't ready. The reason is simple: Parallel programming is rather hard, and historically most ISVs have haven't wanted to invest in it because they could rely on the processors getting faster every year or two... but no longer.
One area where I disagree with TFA is the claimed paucity of programming models and tools. Virtually every OS out there supports some kind of concurrent programming model, and often more than one depending on what language is used -- pthreads, Win32 threads, Java threads, OpenMP, MPI or Global Arrays on the high end, etc. Most debuggers (even gdb) also support debugging threaded programs, and if those don't have enough heft, there's always Totalview. The problem is that most ISVs have studiously avoided using any of these except when given no other choice.
--t
"My life's work has been to prompt others... and be forgotten." --Cyrano de Bergerac
Developers still write programs for single-core chips and need the tools necessary to break up tasks over multiple cores.
So what? If I had a 32 core system, at least each running process (even if single-threaded) could have a core just for itself. Only a few basic applications (such as a browser) really need to be designed for multiples threads.
Most programs barely use any computational power, in fact there are very few programs that require all that computing power to operation and those are certainly well designed.
Home users do use some apps that could benefit from multiple cores. Video encoding is one of them, but that one is embarrassingly parallel because the encoder could just split the video into quadrants and have each of four cores work on one quadrant.
Hey, at least we aren't dealing with the lovely world of Cyrix anymore... those were truly fun times with respect to compiler optimizations (or lack thereof, as it turned out). That and the, um, heat "issues."
512 MB RAM, 20 GB disk, 200 GB transfer, five datacenters. $19.95/month.
What good are multiple cores and threads when you are running event driven GUI application?
Mozilla Firefox is an event-driven GUI application. But if I open a page in a new tab, a big reflow or JavaScript run in that page can freeze the page I'm looking at. You can see this yourself: open this page in multiple tabs, and then try to scroll the foreground page. If Firefox used a thread or process per page like Google Chrome does, the operating system would take care of this. Other applications need to spawn threads when calling an API that blocks, such as gethostbyname() or getaddrinfo(); otherwise, the part of the program that interacts with the user will freeze. But these are the kind of threads that are useful even on a single core, not multicore-specific optimizations.
Seriously, no one has brought up functional programming, LISP, Scala or Erlang? When you use functional programming, no data changes and so each call can happen on another thread, with the main thread blocking when (& not before) it needs the return value. In particular, Erlang and Scala are specifically designed to make the most of multiple cores/processors/machines.
See also map-reduce and multiprocessor database techniques like BSD and CouchDB (http://books.couchdb.org/relax/eventual-consistency).
If you spend more time assigning blame than you do describing the problem, then clearly you don't have anything insightful to say.
"With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea...."
RFC 1925
If you don't believe me, pull out a profiler and run it on one of your programs, it will show you where things can be easily sped up.
Now, given that the performance of most programs is not processor bound
That's a pretty big leap, I think.
Yes a lot of todays apps are more user bound than anything. But there are plenty of real-world apps that people use that are still pretty processor bound - Photoshop, and image processing in general is a big one. So can be video, which starts out disk bound but is heavily processor bound as you apply effects.
Even Javascript apps are processor bound, hence Chrome...
So there's still a big need for understanding how to take advantage of more cores - because chips aren't really getting faster these days so much as more cores are being added.
"There is more worth loving than we have strength to love." - Brian Jay Stanley
Part of the problem is that tools do very little to help break programs down into parallelizable tasks. That has to be done by the programmer, they have to take a completely different view of the problem and the methods to be used to solve it. Tools can't help them select algorithms and data structures. One good book related to this was one called something like "Zen of Assembly-Language Optimization". One exercise in it went through a long, detailed process of optimizing a program, going all the way down to hand-coding highly-bummed inner loops in assembly. And it then proceeded to show how a simple program written in interpreted BASIC(!) could completely blow away that hand-optimized assembly-language just by using a more efficient algorithm. Something similar applies to multi-threaded programming: all the tools in the world can't help you much if you've selected an essentially single-threaded approach to the problem. They can help you squeeze out fractional improvements, but to really gain anything you need to put the tools down, step back and select a different approach, one that's inherently parallelizable. And by doing that, without using any tools at all, you'll make more gains than any tool could have given you. Then you can start applying the tools to squeeze even more out, but you have to do the hard skull-sweat first.
And the basic problem is that schools don't teach how to parallelize problems. It's hard, and not everybody can wrap their brain around the concept, so teachers leave it as a 1-week "Oh, and you can theoretically do this, now let's move on to the next subject." thing.
I disbelieve this entirely. UNIX/Linux is well designed for multiple core CPUs. Just take the whole single program, single small job approach of a pipeline command and you have your multicore solution ready. Programs that can make use of tasks that are IO bound are frequently written with threading in mind. qmail/apache are both well written for mutliple core CPUs. I don't see what the article is trying to say. Its clearly wrong.
Why UNIX?
it is the answer to the question that no one asked...
In a real world application, as others have mentioned pretty much all of a programs time is spent in an idle loop waiting something to happen and in almost all circumstances it is input from the user in whatever form, mouse, keyboard, etc.
So lets say it is something life Final Cut. Now to be sure when someone kicks of a render this is an operation that can be spun off on its own thread or its own process, freeing up the main process loop to respond to other things that the user might be doing, but that is where the rubber really hits the road is user input. The user could do something that affects the process that was just spun off, either as a separate thread or process on the same core or any other number of cores so you have to keep track of what the user is doing in the context of things that have been farmed out into other cores/processes/threads.
Enter the OS.. Take your pick since it really does not matter which OS we are talking about, they all do the same basic things, perhaps differently, but they do. How does an OS designer make sure any of say 16 cores ( dual 8 core processors) are actually well and fairly utilized? Would it be designed to use a core to handle each of the main functions of the OS, lets say Drive Access, Com Stack pick your protocol here, Video Processing etc., or should it just run a scheduler like those that they now run which farms out thread processing based on priority? Is there really any priority scheme for multiple cores that could run say hundreds of threads / processes each? And what about memory? A single core machine that is say truly 64 bit can handle a very large amount of memory and that single core controls and has access to all that ram at its whim ( DMA not withstanding ), but what do you do now that you have 16 cores all wanting to use that memory, do we create a scheduler to schedule access from 16 different demanding stand alone processors or do we simply give each core a finite memory space and then have to control the movement of data from each memory space to another, since a single process thread ( handling the main UI thread for a program ) has to be aware of when something is finished on one core and then get access to that memory to present results either as data written to say a file or written into video memory for display?
I submit that the current paradigm of SMP is inadequate for these tasks and must be rethought to take advantage of this new hardware. I think a more efficient approach is that each core detected would be fired up with its own monitor stack as a place to start so that the scheduling is based upon the feedback from each core. The monitor program would be able to ensure that the core it is responsible for is optimized for the kind of work that is presented. This concept while complicated could be implemented and serve as a basis for further development in this very complex space.
In the terms of "super computers" this has been dealt with but in a very different methodology that I do not think lends itself to general computing. Deep Blue, Cray's and things like that aren't really relevant in this case since those are mostly very custom designs to handle a single purpose and are optimized for things like Chess or Weather Modeling, Nuclear Weapons study where the problem are already discretely chunked out with a known set of algorithms and processes. General purpose computing on the other hand is like trying to heard cats from the OS point of view since you never really know what is going to be demanded and how.
OS designers and user space software designers need to really break this down and think it all the way through before we get much further or all this silicon is not going to used well or efficiently.
Hey KID! Yeah you, get the fuck off my lawn!
The idea of an OS and/or suppoet tools handling the SMP problem is nothing more than a crutch for bad programming.
In fact, anyone who grew up with a real multitheaded, multitasking OS is already writing code that will scale just dandy to 8 cores and beyond. When you accept that a thread is nothing more or less than a typical programming construct, you simply write better code. This is no more or less an amazing thing than when regular programmers embraced subroutines or structures.
This was S.O.P. back in the late 80s under the AmigaOS, and enhanced in the early/mid 90s under BeOS. This in not new, and not even remotely tied to the advent of multicore CPUs.
The problem here is simple: UNIX and Windows. Windows had fake multitasking for so long, Windows programmers barely knew what you could do when you had "thread" in the same toolkit as "subroutine", rather than it being something exotic. UNIX, as a whole, didn't even have lightweight preemptive threads until fairly recently, and UNIX programmers are only slowly catching up.
However, neither of these is even slightly an OS problem... it's an application-level problem. If programmers continue to code as if they had a 70s-vintage OS, they're going to think in single threads and suck on 8-core CPUs. If programmers update themselves to state-of-the-1980s thinking, they'll scale to 8-cores and well beyond.
-Dave Haynie
Multithreading is a system-level thing, not a language level thing.
Sure, there have been languages that make threading ubiquitous, but they've never caught on, and it's hardly necessary.
You'll notice that internet, graphics, and many other programming necessities are not built into C/C++ either. They are higher level functions, and thousands of programmers have no problem understanding C's role here. People have been writing multithreading code in C/C++ for decades... I've personally done in from the 80s until now, under a dozen or so OSs.
Don't use your chosen language as a crutch for sicking to the level of programming practiced when that langauge debuted. The whole point of C was not to define much of anything in C itself.. in truth, the language proper doesn't even do I/O... that's handled via a library. So is threading, so is graphics, etc.
-Dave Haynie
That's incorrect, at least in part. Modern MacOS is based on CMU's Mach, which has had lightweight threading support since long before Apple got into the picture. The OS was completely designed for multiple CPUs, down to the very core.
If modern MacOS apps are not heavily multithreaded (I have no idea, I don't run priorietary hardware anymore, regardless of the OS), that's the fault of programmers not advancing past the days of MacOS 9... it has nothing whatsoever to do with the OS.
-Dave Haynie
Apple have no 2 core intel systems. Period.
Even the lowly Mac mini is a dual-core system. Every laptop is a dual-core system. The Mac Pro is either 4-core (with hyperthreading for a virtual 8-core) or 8-core (with hyperthreading for a virtual 16-core) system.
"Better to keep silent and look the fool, rather than speak and remove all doubt"
Simon.
Physicists get Hadrons!
Windows and Linux aren't designed for PCs beyond quad-core chips, and programmers are to blame for that. Developers still write programs for single-core chips and need the tools necessary to break up tasks over multiple cores.
How many times do we have to tell that Linux *IS* the fscking kernel??
Given that, including Linux and Windows in the same bag doesn't make sense. Which makes the entire post m00t.
Solutions:
1) s/Windows/Windows NT kernel/
2) s/Linux/GNU\/Linux/
Nice try to get a battle though.
There's not even a way in the C or C++ core language to start a new thread. And with many different third party libraries, there'll never be a reliable standard way to do it.
Never? A standard, reliable way to do it will be part of C++0x - so that's hardly "never"...
Unix has for ages run on multi CPU systems. And it does this well. And with easy tools you can harvest the power of all CPUs: the pipe
Every part of the pipe can run on another CPU.
I recently came across fslint, which is a example of heavily piped shell.
In short (leaving out the parameters and options) it runs
find | sort | tr | sort | bash | merge_hardlinks| uniq | sort | cut | tr | bash | xargs | sort | uniq | cut | sort | tr | xargs | sort | uniq | cut | sort |tr | xargs | sort | uniq | cut | bash | sort
That's a lot of CPUs :-)
OK it's not a great example for CPU hungry programs. But the progress of the modern programming languages which tend to be monolythic beasts to do everything (perl, php, java) lead to programs not using pipes or other types of inter process communication because it's just cumbersome.
The pipe concept enables multi CPU programming without even thinking about how to put tasks on different processors.
Unfortunatly I have not found a language which sets such a simple concept as the fundamental programming principle.
See the unix shell, without the pipe you can't really do much.
Atari rules... ermm... ruled.
I'm not sure I totally agree that Haskell if the future, although I do think that functional programming right now looks to be the the most promising way to deal with muli-cores. Scala has some very strong points that can see it's adoption beat the other, specifically being able to run in the Java JVM and make use of existing Java libraries. You can use the function aspects of Scala when you need to, but still use Java where you do not need parallelism.
Knuth's maxim is sufficiently pithy to have become, over time, self referential, as evidenced by your misunderstanding.
The root of all evil used to be deep and singular, now it is broad and shallow. I guarantee you that Knuth did not include choosing the best fundamental algorithm under the label "premature" unless it involves squabbling over log log N terms or stray digits in the exponent term.
http://www.siam.org/pdf/news/174.pdf
An unpacked (deoptimized) version of Knuth's maxim is that the transition from program structure and notation which maximizes readability, comprehension, and conviction (concerning its correctness and merit) to one which favours performance should be delayed as long as possible. Ideally until performance becomes the sole remaining success factor.
(Taking into account the human mind's special capacity to imprint upon evil, Knuth's formulation remains the better one.)
Originally Knuth meant manually hoisting loop constant expressions (often in ways that later turn out to not be fully general) or manually evaluating constant expressions or manually fusing nested function calls and the kind of rot that a good compiler these days will do on your behalf. Anyone used the "register" keyword lately? Once upon a time it seemed like a good idea.
While the principle remains the same, the temptations have changed. Such as parallelizing a bad implementation of a poor algorithm in the misguided belief that the underlying task is not sequentially bound.
That said, projects which do *no* evil typically fail to impress anyone. The ideal is to wrap large amount of cleanly structured and accessible source code around a nugget of pure, smoldering evil, coked to the last clock cycle.
Perversely, the worst example of this is TeX itself. The smoldering nugget of pure evil is the single pass parsing regime and data packing eight bit character values.
I suspect the literature on parallel programming would roughly equal the literature on electro-chemical storage cells. Sheesh, if only those guys were paying attention, we'd have watch batteries powering small cities by now.
On second thought, how much literature could there really be if you can summon the majority of it onto your screen in 4/10'ths of a second for any combination of keywords?
Parallel programming is a lot like fuel cells. You get some pretty impressive results on selected applications involving pristine apparatus in a controlled setting, dating back to the Apollo program (in both cases).
Reality on the ground is rarely so forgiving.
If we hadn't already achieved a pixel processing speed-up between 1980 and 2008 best approximated by a sideways 8, Javascript wouldn't even have entered the conversation.
It boils down to this: ignoring everything you guys have already accomplished, you've pretty much done nothing. I worked for that kind of company once. The guy in charge put on a Cirque du Soleil of intestinal recursion. That's how I feel about the claim that software developers haven't been paying attention to parallelism for elephant years.
It's really quite frustrating to see posts like this. Posts that dont take into account what is needed and focus on what we are incapable of doing - even when they dont need to.
So lets look at reality for second. First, most modern OS's scale very very far past 4 cpu's (not sure what windows scales to, but linux certainly has no limitation based on current cpu reality). So the kernels are just dandy for multi-core cpu's, bring it on! 128 cores, we're ready for ya!.
The same is not true at the application level, and that is a fair comment. But dont confuse linux and windows with their apps for crying out loud! From an application point of view we are capable of parallel coding, but its non-trivial. Its also not something we need alot of the time.
For instance, we now buy servers (our cheapest models) with dual cpu's and quad cores and we're tending to virtualise it up into several machines with 1 or 2 cpu's each. Now whether you do this because you assume the OS will utilise one cpu and the apps will utilise another (as one person told me is irrelavent). Surfice it to say, having 2 cpu's is usually quite nice.
But what requires more then that in reality? well, your desktop might - after all theres alot of things going on at once right? In some point cases, thats true (there are quite a number of very heavy applications out there, and supprise supprise, they can multitask *GASP*).
Same at the server, not many things require that many CPU's and even at the application level, we've gotten good at spreading heavily loaded applications across multiple servers (we call it load balancing, was that too sarcastic?). Take mail (weather its exchange or postfix or sendmail or whatever), or web servers, etc. Those server applications that do require heavy grunt tend to already be coded with "parallel" in mind, even across multiple servers (think oracle RAC).
As for cache contention - well it sounds like the hardware makers are finally fess'ing up to the fact they have a problem, Houston!
As has already been explained, Non-Sequential thinking is hard, you postulate double speed, BUT the producer thread, the app finished and handed of the buffer to the OS to send to the GPU, and you say it threads this. Well fine, so the threaded part can run on another core, but then hardware DMAs the data and waits for a GPU interrupt/done-queue ack so how does this speed things up on multicore. Not at all, someone has to set up the DMA and wait, not run, while it completes, so unless all cores are at 100% you have saved nothing, and created additional overhead spawning a new thread
Duh, Marketing Departments
I think I agree with you, BUT... don't fall into the old trap: If ten machines can do the job in 1 month, 1 machine can do the job in 10 months. But it doesn't necessarily follow that if one machine can do the job in 10 months, 10 machines can do the job in 1 month.
Also, the problem with runtime interpreters is not that they don't generate assembly code. The problem is that it is harder to get at the underlying code that is really executing. That code could be optimized if you could see it. But seeing it is just more difficult.
Behold, this dreamer cometh. Come now, and let us slay him... and we shall see what will become of his dreams.
The fact that all we do is sequential tasks on our computer means we are still pretty stupid when it comes to "computing". If you look outside your CPU, you'll see the rest of the computers on this planet are massively parallel and do tons and tons of very complex operations far quicker than the computer running on either one of our desks.
Most of the computers on the planet are organic ones inside of critters of all shapes and sizes. I dont see those guys running around with some context-switching, mega-fast CPU, do you?**. All the critters I see are using parallel computers with each "core" being a rather slow set of neurons.
Basically, evolution of life on earth seems to suggest that the key to success is going parallel. Perhaps we should take the hint from nature.
** unless you count whatever the hell consciousness itself is... "thinking" seems to be single-threaded, but uses a bunch of interrupt hooks triggered by lord knows what running under the hood.
[T]hats why on Mac, Linux or Windows you stick with code that will just work on one core. No problems then.
That, and the much greater reason that (a) 99% of software these days would run just fine on a single core P4 3GHz, and (b) most programmers are really, really bad and it's much harder to screw up a single-threaded app badly enough that I can't fix it, than it is to screw up a multi-threaded app.
Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.
Do you need to dedicate an entire 3ghz CPU core to run your bittorrent, and another to refresh slashdot?
Parallel computing and parallel hardware have been around for decades - not on the desktop, but in the supercomputer area. It's a tough problem to solve efficiently - there are some things which are hard to get around. As an example think of the equation y = SQRT(a*b) - you need two mathematical operations there. It doesn't really help if you have two processors, since you need the result of one operation before you can perform the second. The example isn't very interesting, but essentially you always have this problem - if you rely on the result of the previous steps, then you need to do things in order. You can modify your algorithms so that happens less often, but this is hard work and interferes with your desire to write clean readable code.
If we are talking about technology... The Linux operating system (monolith kernel is the operating system) works great on CPU's what have more than 4 cores. If the article writer did not know, the Linux OS powers almost all supercomputers etc. The problem is that applications ain't developed to use so many threads etc. The OS just works fine but if the applications can not use multiple threads, you do not gain anything. If you do not run multiple instanses of them.
If we are talking about marketing lies and misinformation, the "operating system" (actually a _software system_) does not work at all, because usually this "operating system" can not use the multicore CPU's well. Who should we blame?
Serioysly, Linux just works on multicore CPU's but that is just an operating system. The software systems like Ubuntu, Fedora and Mandriva just ain't working so well.
"The problem my dear programmer, as you so elequently put, is one of choice.."
Seriously. I have been involved with software development from 8-bit pics to Cluster's spanning wans and everything in between for the past 20 years or so.
Multiprocessing involves coordination between the processes. It doesn't matter (too much) whether it's separate cores or separate silicon. On any given modern OS there are plenty of examples of multiprocessor execution: Hard drives each have a processor, video cards each have a processor, USB controllers have a processor. All of these work because there is a well-defined API between them and the OS - a.k.a device drivers. People that write good device drivers (and kernel code) understand how an OS works. This is not generally true of the broader developer population.
Developer's keep blaming the CPU manufactures' that it's their fault. It's not. What prevents parallel processing from becoming mainstream is the lack of a standard inter-process communications mechanism (at the language level) that abstracts a lot of the dirty little details that are needed. Once the mechanism is in place, then people will start using it. I am not referring to semaphores and mutexes. These are synchronization mechanisms, NOT (directly) communication mechanisms... I am not talking about queues either - too much leeway on their use. Sockets would be closer, but most people think of sockets for "network" applications. They should be thinking of them as "distributed applications". As in distrbuted across cores. As an example, Microsoft just recently started to demonstrate that they "get it" because with the next release of VS. It will have a messaging library.
choice:
At this time there are too many different ways to implement multi-threaded/multi-processor aware software. Each implementation has possible bugs - race conditions, lockups, priority inversion, etc. The choices need to be narrowed
Having a standard (language & OS) API is the key to providing a framework for developer's to use, yet still allowing them the freedom to customize for specific needs. So the OS needs an interface for setting CPU/core preferences and the language needs to provide the API. Once there is an API, developer's can "wrap their minds" around the concept and then things will "take off". As I stated previously, I prefer the "message box" mechansims simply because they port easily, are easy to understand and provide for a very loosely coupled interaction. All good tenants of a multi-threaded/multi-processor implementation.
Danger Will Robinson:
One thing that I fear is that once the concept catches on, it will be overused or abused. People will start writing threads and processes that don't do enough work to justify the overhead. Everyone who starts writing programs will "advertise" that it's "multi-threaded", as if this somehow automatically indicates quality and/or "better" software...Not.