Windows and Linux Not Well Prepared For Multicore Chips
Mike Chapman points out this InfoWorld article, according to which you shouldn't immediately expect much in the way of performance gains from Windows 7 (or Linux) from eight-core chips that come out from Intel this year. "For systems going beyond quad-core chips, the performance may actually drop beyond quad-core chips. Why? Windows and Linux aren't designed for PCs beyond quad-core chips, and programmers are to blame for that. Developers still write programs for single-core chips and need the tools necessary to break up tasks over multiple cores. Problem? The development tools aren't available and research is only starting."
Firstly, it's false on the face of it: Ubuntu is certified on Sun T2000, a 32-thread and Canonical is supporting it.
Secondly. it's the same FUD as we heard from uniprocessor manufacturers when multiprocessors first came out: this new "symmetrical multiprocessing" stuff will never work, it'll bottleneck on locks.
The real problem is that some programs are indeed badly written. In most cases, you just run lots of individual instances of them. Others, for grid, are well-written, and scale wonderfully.
The ones in the middle are the problem, as they need to coordinate to some degree, and don't do that well. It's a research area in computer science, and one of the interesting areas is in transactional memory.
That's what the folks at the Multicore Expo are worried about: Linux itself is fine, and has been for a while.
--dave
davecb@spamcop.net
get a mac..
I assume you're talking about Mac OS X 10.6 (Snow Leopard), whose Grand Central framework is supposed to add some tools to make Mac-exclusive multithreaded apps easier to program.
imagine software being developed for imaginary or speculatory hardware.
I think Sun called it "Java". It was run on emulators long before ARM and others came out with hardware-assisted JVMs such as Jazelle.
The /. summary of TFA is almost exquisitely bad. It's not Window or Linux that's not ready for multicore (as both have supported multi-processor machines for on the order of a decade or more), but rather the userspace applications that aren't ready. The reason is simple: Parallel programming is rather hard, and historically most ISVs have haven't wanted to invest in it because they could rely on the processors getting faster every year or two... but no longer.
One area where I disagree with TFA is the claimed paucity of programming models and tools. Virtually every OS out there supports some kind of concurrent programming model, and often more than one depending on what language is used -- pthreads, Win32 threads, Java threads, OpenMP, MPI or Global Arrays on the high end, etc. Most debuggers (even gdb) also support debugging threaded programs, and if those don't have enough heft, there's always Totalview. The problem is that most ISVs have studiously avoided using any of these except when given no other choice.
--t
"My life's work has been to prompt others... and be forgotten." --Cyrano de Bergerac
Most programs barely use any computational power, in fact there are very few programs that require all that computing power to operation and those are certainly well designed.
Home users do use some apps that could benefit from multiple cores. Video encoding is one of them, but that one is embarrassingly parallel because the encoder could just split the video into quadrants and have each of four cores work on one quadrant.
Since the normal OoO parallelization mechanisms don't scale well enough
It hit me that this probably wasn't obvious to everyone, so just to clarify: "OoO", here, stands not for Object-Oriented Something, but for Out-of-Order, as in how current, superscalar CPUs work. See also Dataflow architecture.
The problem with very long instruction word (VLIW) architectures like the EPIC and the Itanium, is that the main speed limitations in today's computers are bandwidth and latency. Memory bandwidth and latency can be the dominant performance driver in a modern processor. At a system level, network, I/O (particularly for the video), and a hard drive bandwidth and latency can dramatically affect system performance.
With a VLIW processor, you are taking many small instruction words, and gathering them together into a smaller number of much larger instruction words. This never pays off. Essentially, it is impossible to always use all of the larger instruction words. Even with a normal super-scalar processor, it is almost impossible to get every functional unit on the chip to do something simultaneously. The same problem applies with VLIW processors. Most of the time, a program is only exercising a specific area of the chip. With VLIW, this means that many bits in the instruction word will go unused much of the time.
In and of itself, wasting bits in an instruction word isn't a big deal. Modern processors can move large amounts of memory simultaneously, and it is handy to be able to link different sections of the instruction word to independent functional blocks inside the processor. The problem is the longer instruction words use memory bandwidth every time they are read. Worse, the longer instruction words take up more space in the processor's cache memory. This either requires a larger cache, increasing the processor cost, or it increases latency, as it translates into fewer cache hits. It is no accident the Itanium is both expensive and has an unusually large on-chip cache.
The other major downfall of the VLIW architecture is that it cannot emulate a short instruction word processor quickly. This is a problem both for interpreters and for 80x86 emulation. Interpreters are a very popular application paradigm. Many applications contain them. Certain languages, like .NET and Java, use pseudo-interpreters/compilers. 80x86 emulation is a big deal, as the majority of the worlds software is written for an 80x86 platform, which features a complex variable length instruction word. The long VLIW instructions are unable to decode either the short 80x86 instructions, or the Java JIT instruction set, quickly. Realistically, a VLIW instruction processor will be no quicker, on a per instruction basis, than an 80x86 processor, despite the fact the VLIW architecture is designed to execute 4 instructions simultaneously.
The memory bandwidth problem, and the fact that VLIW processors don't lend themselves to interpreters, really slows down the usefulness of the platform.
And if you look at a level lower that the profiler, you find your programs are memory-bound, and getting worse. That's a big part of the push toward multithreaded processors.
To paraphrase another commentator, they make process switches infinitely fast, so one can keep on using the ALU while your old thread is twiddling its thumbs waiting for a cache-line fill.
--dave
davecb@spamcop.net
Apple have no 2 core intel systems. Period.
Even the lowly Mac mini is a dual-core system. Every laptop is a dual-core system. The Mac Pro is either 4-core (with hyperthreading for a virtual 8-core) or 8-core (with hyperthreading for a virtual 16-core) system.
"Better to keep silent and look the fool, rather than speak and remove all doubt"
Simon.
Physicists get Hadrons!
This is simply not true. Assuming both cores are fully loaded, which is the best possible case for dual core, then they will still be performing context switches at the same rate as a single chip if you are running more than one process per core. Even if you had the perfect theoretical case for two cores, where you have two independent processes and never context switch, you could run them much faster on the single-core machine. A single-core 5GHz CPU would have to waste 20% of its time on context switching to be slower than a dual-core 2GHz CPU, while a real CPU will spend less than 1% (and even on the dual-core CPU, most of the time your kernel will be preempting the process every 10ms, checking if anything else needs to run, and then scheduling it again, so you don't save much).
The only way the dual core processor would be faster in your example would be if it had more cache than the 5GHz CPU and the working set for your programs fitted into the cache on the dual-core 2GHz chip but not on the 5GHz one, but that's completely independent of the number of cores.
I am TheRaven on Soylent News
Unix didn't for a long time have lightweight preemptive threads because it had, from the very beginning, lightweight preemptive processes. I spent a lot of time wondering why Windows programmers were harping on the need for threads to do what I'd been doing for a decade with a simple fork() call. And in fact if you look at the Linux implementation, there are no threads. A thread is simply a process that happens to share memory, file descriptors and such with it's parent, and that has some games played with the process ID so it appears to have the same PID as it's parent. Nothing new there, I was doing that on BSD Unix back in '85 or so (minus the PID games).
That was, in fact, one of the things that distinguished Unix from VAX/VMS (which was in a real sense the predecessor to Windows NT, the principal architect of VMS had a big hand in the architecture and internals of NT): On VMS process creation was a massive, time-consuming thing you didn't want to do often, while on Unix process creation was fast and fairly trivial. Unix people scratched their heads at the amount of work VMS people put into keeping everything in a single process, while VMS people boggled at the idea of a program forking off 20 processes to handle things in parallel.
Short answer: only one thing I mentioned involved disk I/O, RAM is cheap.
Not in modern architectures and it depends. Registers are faster than L1 caches. L1 caches are faster than L2 caches, etc.
See: http://lwn.net/Articles/250967/ for an excellent discussion about how one can dramatically speed up applications by optimizing memory access.
And I disagree with the title of this thread - Linux (the kernel at least) is quite well prepared for multicore chips.