Windows and Linux Not Well Prepared For Multicore Chips
Mike Chapman points out this InfoWorld article, according to which you shouldn't immediately expect much in the way of performance gains from Windows 7 (or Linux) from eight-core chips that come out from Intel this year. "For systems going beyond quad-core chips, the performance may actually drop beyond quad-core chips. Why? Windows and Linux aren't designed for PCs beyond quad-core chips, and programmers are to blame for that. Developers still write programs for single-core chips and need the tools necessary to break up tasks over multiple cores. Problem? The development tools aren't available and research is only starting."
Multiple virtual machines on the same piece of metal, with a workstation hypervisor, and intelligent balancing of apps between backends.
Multiple OSes sharing the same cores. Multiple apps running on the different OSes, and working together.
Which can also be used to provide fault tolerance... if one of the worker apps fails, or even one of the OSes fails, your processor capability is reduced, a worker app in a different OS takes over, use checkpointing procedures, and shared state, so the apps don't even lose data.
You should even be able to shutdown a virtual OS for windows updates without impact, if the apps that arise get designed properly...
Languages like PHP/Perl, as a rule, are not designed for threading - at ALL. This makes multi-core performance a non-starter. Sure, you can run more INSTANCES of the language with multiple cores, but you can't get any single instance of a script to run any faster than what a single core can do.
I have, so, so, SOOOO many times wished I could split a PHP script into threads, but it's just not there. The closest you can get is with (heavy, slow, painful) forking and multiprocess communication through sockets or (worse) shared memory.
Truth be told, there's a whole rash of security issues through race conditions that we'll soon have crawling out of nearly every pore as the development community slowly digests multi-threaded applications (for real!) in the newly commoditized multi-CPU environment.
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Too bad BeOS died. One of the axioms the developers had was 'the machine is a multi processor machine', and everything was built to support that.
Seems like they were 15 years ahead of their time. But, on the other hand, too late to establish an other OS in a saturated market. Pity, really.
The onus may ultimately lie with developers to bridge the gap between hardware and software to write better parallel programs......They should open up data sheets and study chip architectures to understand how their code can perform better, he said.
Here's the problem, most programs spend 99% of its time waiting. MOST of that is waiting for user input. Part of it is waiting for disk access (as mentioned in the AnandTech story, the best thing you can do to speed up your computer is get a faster hard drive/SSD). A miniscule part of it is spent in the processor. If you don't believe me, pull out a profiler and run it on one of your programs, it will show you where things can be easily sped up.
Now, given that the performance of most programs is not processor bound, what is there to gain by parallelizing your program? If the performance gain were really that significant, I would already be writing my program with threads, even with the tools we have now. The fact of the matter is in most cases, there is really no point to writing your program in a parallel manner. This is something a lot of the proponents of Haskell don't seem to understand, that even if their program is easily paralellizable, the performance gain is not likely to be noticeable. Speeding up hard drives will make more of a difference to performance in most cases than adding cores.
I for one am certainly not going to be reading chip data sheets unless there's some real performance benefit to be found. If there's enough benefit, I may even write parts in assembly, I can handle any ugliness. But only if there's a benefit from doing so.
Qxe4
No, it's not about adaptation. The whole approach currently taken is completely, outright on-its-head wrong.
To begin with, I don't believe the article about the systems being badly prepared. I can't speak for Windows, but I know for sure that Linux is capable of far heavier SMP operation than 4 CPUs.
But more importantly, many programming tasks simply aren't meaningful to break up into such units of granularity is OS-level threads. Many programs would benefit from being able to run just some small operations (like iterations of a loop) in parallel, but just the synchronization work required to wake up even a thread from a pool to do such a thing would greatly exceed the benefit of it.
People just think about this the wrong way. Let me re-present the problem for you: CPU manufacturers have been finding it harder to scale the clock frequencies of CPUs higher, and therefore they start adding more functional units to CPUs to do more work per cycle instead. Since the normal OoO parallelization mechanisms don't scale well enough (probably for the same reasons people couldn't get data-flow architectures working at large scales back in the 80's), they add more cores instead.
The problem this gives rise to, as I stated above, is that the unit of parallelism gained by more CPUs is to large to divide the very small units of work that exist among. What is needed, I would argue, is a way to parallelize instructions in the instruction set itself. HP's/Intel's EPIC idea (which is now Itanium) wasn't stupid, but it has a hard limitation on how far it scales (currently four instructions simultaneously).
I don't have a final solution quite yet (though I am working on it as a thought project), but the problem we need to solve is getting a new instruction set which is inherently capable of parallel operation, not on adding more cores and pushing the responsibility onto the programmers for multi-threading their programs. This is the kind of the the compiler could do just fine (even the compilers that exist currently -- GCC's SSA representation of programs, for example, is excellent for these kinds of things), by isolating parts of the code in which there are no dependencies in the data-flow, and which could therefore run in parallel, but they need the support in the instruction set to be able to specify such things.
The final solution is that the processor measures and decides which part of which program must be run parallel and which are better off left alone.
What else do we have computers for?
As I mentioned briefly in my post, there was research into dataflow architecures in the 70's and 80's, and it turned out to be exceedingly difficult to do such things efficiently in hardware. It may very well be that they still are the final solution, but until such time as they become viable, I think doing the same thing in the compiler, as I proposed, is more than enough. That's still the computer doing it for you.
I don't entirely agree with you here. A lot of current applications *do* suffer from CPU-induced latency after user interactions, and the problem is simple: they don't differentiate between the things that must get done before control is returned to the user, and the things that need to happen in response to the action but can be allowed to happen whenever resources are free. Even when the problem is resource-access latency, multithreading can be a win because that latency no longer contributes to the latency that the user perceives if it happens on a background thread.
Something as simple as tossing function calls off on a background thread to deal with some of these tasks would do a great deal to improve latency from the user's perspective, and is really quite trivial to implement. Most programmers don't do it, though. Part of that is that in most situations there aren't ready-made solutions- you can't just say "run this function call on a background thread", you've got to go through the pthread creation process, etc. (Apple's Cocoa framework is actually an exception to this with it's NSOperation).
The situation is analogous to that of an interrupt task: Do absolutely as little as possible before returning; everything else should happen on some other thread.
I agree with you regarding optimization, but it's been my experience that many applications *can* benefit from these sorts of simple multithreading techniques- the programmers just don't do them, either from lack of ability or lack of resources.
The ringing of the division bell has begun... -PF
All that which you say is certainly true, but I would still argue that EPIC's greatest problem is its hard parallelism limit. True, it's not as hard as I tried to make it out, since an EPIC instruction bundle has its non-dependence flag, but you cannot, for instance, make an EPIC CPU break off and execute two sub-routines in parallel. Its parallelism lies only in very small spatial window of instructions.
What I'd like to see is, rather, that the CPU can implement a kind of "micro-thread" function, that would allow two larger codepaths simultaneously -- larger than what EPIC could handle, but quite possibly still far smaller than what would be efficient to distribute on OS-level threads, with all the synchronization and scheduler overhead that would mean.
So, we can broadly say that there are three areas where we can parallelise.
First you have the document level. Google Chrome is a good example of this - first we had the concept of multiple documents open in the same program, now we have the concept for a separate thread for each "document" (or tab in this case). Games are also moving ahead in this area, using separate threads for graphics, AI, sound, physics and so on.
Then you have the OS level. Say the user clicks to sort a table of data into a new order, the OS can take care of that. It's a standard part of the GUI system, and can be set off as a separate thread. Of course, some intelligence is required here as it's only worth spawning another thread if the sort is going to take some appreciable amount of time.
At the bottom you have the algorithm level, which is the hard one. So far this level has got a lot of attention, but the others relatively little. The first two are the low hanging fruit, which is where people should be concentrating.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
The fact that all we do is sequential tasks on our computer means we are still pretty stupid when it comes to "computing". If you look outside your CPU, you'll see the rest of the computers on this planet are massively parallel and do tons and tons of very complex operations far quicker than the computer running on either one of our desks.
Most of the computers on the planet are organic ones inside of critters of all shapes and sizes. I dont see those guys running around with some context-switching, mega-fast CPU, do you?**. All the critters I see are using parallel computers with each "core" being a rather slow set of neurons.
Basically, evolution of life on earth seems to suggest that the key to success is going parallel. Perhaps we should take the hint from nature.
** unless you count whatever the hell consciousness itself is... "thinking" seems to be single-threaded, but uses a bunch of interrupt hooks triggered by lord knows what running under the hood.
You may be interested to know that, as far as I can tell from the rather fuzzy documention, the MSM7201A processor used in the G1 smartphone has at least three dissimilar cores, and potentially five:
I gather that it's pretty hard to make them share address spaces, even the two ARMs; so SMP is probably not feasible. Message-passing via specific shared memory segments is the usual approach.