Windows and Linux Not Well Prepared For Multicore Chips
Mike Chapman points out this InfoWorld article, according to which you shouldn't immediately expect much in the way of performance gains from Windows 7 (or Linux) from eight-core chips that come out from Intel this year. "For systems going beyond quad-core chips, the performance may actually drop beyond quad-core chips. Why? Windows and Linux aren't designed for PCs beyond quad-core chips, and programmers are to blame for that. Developers still write programs for single-core chips and need the tools necessary to break up tasks over multiple cores. Problem? The development tools aren't available and research is only starting."
So basically yet another tech writer finds out that a huge number of applications are still single threaded, and that it will be a while before we have applications that can take advantage of the cores that the OS isn't actively using at the moment. Well, assuming you're running a desktop and not a server.
This isn't a performance issue with regards to Windows or Linux, they're quite adept at handling multiple cores. They just don't need that much themselves and the applications run these days, individually, don't need much more than that either.
So yes, applications need parallelization. The tools for it are rudimentary at best. We know this. Nothing to see here.
Multiple virtual machines on the same piece of metal, with a workstation hypervisor, and intelligent balancing of apps between backends.
Multiple OSes sharing the same cores. Multiple apps running on the different OSes, and working together.
Which can also be used to provide fault tolerance... if one of the worker apps fails, or even one of the OSes fails, your processor capability is reduced, a worker app in a different OS takes over, use checkpointing procedures, and shared state, so the apps don't even lose data.
You should even be able to shutdown a virtual OS for windows updates without impact, if the apps that arise get designed properly...
...programmers are to blame for that
The development tools aren't available and research is only starting."
Stupid programmers! Not able to develop software without the tools! In my day we wrote our own tools - in the snow, uphill, both ways! We didn't need no stink'n vendor to do it for us - and we liked it that way!
Firstly, it's false on the face of it: Ubuntu is certified on Sun T2000, a 32-thread and Canonical is supporting it.
Secondly. it's the same FUD as we heard from uniprocessor manufacturers when multiprocessors first came out: this new "symmetrical multiprocessing" stuff will never work, it'll bottleneck on locks.
The real problem is that some programs are indeed badly written. In most cases, you just run lots of individual instances of them. Others, for grid, are well-written, and scale wonderfully.
The ones in the middle are the problem, as they need to coordinate to some degree, and don't do that well. It's a research area in computer science, and one of the interesting areas is in transactional memory.
That's what the folks at the Multicore Expo are worried about: Linux itself is fine, and has been for a while.
--dave
davecb@spamcop.net
"The development tools aren't available and research is only starting"
Hardly. Erlang's been around 20 years. Newer languages like Scala, Clojure, and F# all have strong concurrency. Haskell has had a lot of recent effort in concurrency (www.haskell.org/~simonmar/papers/multicore-ghc.pdf).
If you prefer books there's: Patterns for Parallel Programming, the Art of Multiprocessor Programming, and Java Concurrency in Practice, to name a few.
All of these are available now, and some have been available for years.
The problem isn't that tools aren't available, it's that the programmers aren't preparing themselves and haven't embraced the right tools.
Too bad BeOS died. One of the axioms the developers had was 'the machine is a multi processor machine', and everything was built to support that.
Seems like they were 15 years ahead of their time. But, on the other hand, too late to establish an other OS in a saturated market. Pity, really.
The quote presented in the summary is nowhere to be found in the linked article. To make matters worse, the summary claims that linux and windows aren't designed for multicore computers but the linked article only claims that some applications are not designed to be multi-threaded or running multiple processes. Well, who said that every application under the sun must be heavily multi-threaded or spawning multiple processes? Where's the need for a email client to spawn 8 or 16 threads? Will my address book be any better if it spans a bunch of processes?
The article is bad and timothy should feel bad. Why is he still responsible for any news being posted on slashdot?
The /. summary of TFA is almost exquisitely bad. It's not Window or Linux that's not ready for multicore (as both have supported multi-processor machines for on the order of a decade or more), but rather the userspace applications that aren't ready. The reason is simple: Parallel programming is rather hard, and historically most ISVs have haven't wanted to invest in it because they could rely on the processors getting faster every year or two... but no longer.
One area where I disagree with TFA is the claimed paucity of programming models and tools. Virtually every OS out there supports some kind of concurrent programming model, and often more than one depending on what language is used -- pthreads, Win32 threads, Java threads, OpenMP, MPI or Global Arrays on the high end, etc. Most debuggers (even gdb) also support debugging threaded programs, and if those don't have enough heft, there's always Totalview. The problem is that most ISVs have studiously avoided using any of these except when given no other choice.
--t
"My life's work has been to prompt others... and be forgotten." --Cyrano de Bergerac
No, it's not about adaptation. The whole approach currently taken is completely, outright on-its-head wrong.
To begin with, I don't believe the article about the systems being badly prepared. I can't speak for Windows, but I know for sure that Linux is capable of far heavier SMP operation than 4 CPUs.
But more importantly, many programming tasks simply aren't meaningful to break up into such units of granularity is OS-level threads. Many programs would benefit from being able to run just some small operations (like iterations of a loop) in parallel, but just the synchronization work required to wake up even a thread from a pool to do such a thing would greatly exceed the benefit of it.
People just think about this the wrong way. Let me re-present the problem for you: CPU manufacturers have been finding it harder to scale the clock frequencies of CPUs higher, and therefore they start adding more functional units to CPUs to do more work per cycle instead. Since the normal OoO parallelization mechanisms don't scale well enough (probably for the same reasons people couldn't get data-flow architectures working at large scales back in the 80's), they add more cores instead.
The problem this gives rise to, as I stated above, is that the unit of parallelism gained by more CPUs is to large to divide the very small units of work that exist among. What is needed, I would argue, is a way to parallelize instructions in the instruction set itself. HP's/Intel's EPIC idea (which is now Itanium) wasn't stupid, but it has a hard limitation on how far it scales (currently four instructions simultaneously).
I don't have a final solution quite yet (though I am working on it as a thought project), but the problem we need to solve is getting a new instruction set which is inherently capable of parallel operation, not on adding more cores and pushing the responsibility onto the programmers for multi-threading their programs. This is the kind of the the compiler could do just fine (even the compilers that exist currently -- GCC's SSA representation of programs, for example, is excellent for these kinds of things), by isolating parts of the code in which there are no dependencies in the data-flow, and which could therefore run in parallel, but they need the support in the instruction set to be able to specify such things.
I think this problem will take longer than a year or two to solve. Modern computers are really fast. They solve simple problems, almost instantly. A side-effect of this, is that if you underestimate the computational power required for the problem at hand, then you are likely to be off by large amounts.
If you implement an order n-squared algorithm, O(n^2), on a 6502 (Apple II), if n was larger than a few hundred, you were dead. Many programmers wouldn't even try implementing hard algorithms on the early Apple II computers. On the other hand, a modern processor might tolerate O(n^2) algorithms with n larger than 1000. Programmers can try solving much harder problems. However, the programmers ability to estimate and deal with computational complexity has not changed since the early days of computers. Programmers use generalities. They use ranges: like n will be between 5 and 100, or n will be between 1000 and 100,000. With modern problems, n=1000 might mean the problem can be solved on a netbook, and n=100,000 might require a small multi-core cluster.
There aren't many programming platforms out there that scale smoothly between applications deployed on a desktop, to applications deployed on a multi-core desktop, and then to clusters of multi-core desktops. Perhaps most worrying, is that the new programming languages that are coming out, are not particularly useful for intense data analysis. The big examples of this for me are: .NET and Functional Languages. .NET deployed at about the same time multi-core chips showed up, and has minimal support for it. Functional languages may eventually be the solution, but for any numerically intensive application, tight loops of C code are much faster.
The other issue with multi-core chips, is that as a programmer, I have two solutions to making my code go faster: .NET, or many functional languages, because they use run-time compilers/interpreters and don't generate assembly code.
1. Get out the assembly print outs and the profiler, and figure out why the processor is running slow. Doing this, helps every user of the application, and works well with almost any of the serious compiled languages (C, C++). Sometimes, I can get a 10:1 speed improvement.(*) It doesn't work so well with Java,
2. I recode for a cluster. Why stop at a multi-core computer? If I can get a 2:1 to 10:1 speed up by writing better code, then why stop at a dual or quad core? The application might require a 100:1 speed up, and that means more computers. If I have a really nasty problem, chances are that 100 cores are required, not just 2 or 8. Multi-core processors are nice, because they reduce cluster size and cost, but a cluster will likely be required.
The problem with both of the above approaches, is that from a tools perspective, they are the worst choice for multi-core optimizations. Approach 1 will force me into using C and C++, which doesn't even handle threads really well. In particular, C and C++ lacks an easy implementation of Software Transactional Memory, NUMA, and clusters. This means that approach 2 may require a complete software redesign, and possibly either a language change or a major change in the compilation environment. Either way, my days of fun loving Java and .NET code are coming to a sudden end.
I just don't think there is any easy way around it. The tools aren't yet available for easy implementation of fast code that scales between the single-core assumption and the multi-core assumption in a smooth manner.
Note: * - By default, many programmers don't take advantage of many features that may increase the speed of an algorithm. Built-in special purpose libraries, like MMX, can dramatically speed up certain loops. Sometimes loops contain a great deal of code that can be eliminated. Maybe a function call is present in a tight loop. Anti-virus software can dramatically affect system speed. Many little things can sometimes make big differences.
Perl has excellent support for building threaded applications. See http://perldoc.perl.org/threads.html . I code multi-threaded apps in perl all the time and they utilize my quad-code very efficiently - in fact, my biggest hassle with multithreading is keeping the CPU cooled! There's also a threads::shared module (http://perldoc.perl.org/threads/shared.html) for handling locks, etc. I'd be hard pressed to imagine better language support for threading. Hardware, operating systems, and a lot of languages support threading. Granted, it isn't always easy/possible/worth it, but as things currently stand, the only bottleneck is programmers who are too lazy to design their algorithms for parallel execution.
Since the normal OoO parallelization mechanisms don't scale well enough
It hit me that this probably wasn't obvious to everyone, so just to clarify: "OoO", here, stands not for Object-Oriented Something, but for Out-of-Order, as in how current, superscalar CPUs work. See also Dataflow architecture.
To dumb your message down, CPU manufacturers act like book publishers who want you to read one book in two different places at the same time just because you happen to have two eyes. But a story can't be read this way, and for the same reason most programs don't benefit from several CPU cores. Books are read page by page because each little bit of story depends on previous story; buildings are constructed one floor at a time because each new floor of a building sits on top of lower floors; a game renders one map at a time because it's pointless to render other maps until the player made his gameplay decisions and arrived there.
In this particular case CPU manufacturers do what they do simply because that's the only thing they know how to do. We, as users, for most tasks would rather prefer a single 1 THz CPU core, but we can't have that yet.
There are engineering and scientific tasks that can be easily subdivided - this comes to mind - and these are very CPU-intensive tasks. They will benefit from as many cores as you can scare up. But most computing in the world is done using single-threaded processes which start somewhere and go ahead step by step, without much gain from multiple cores.
This is the sort of thing I like about Apple's 'Grand Central'. The idea behind is that instead of assigning a task to a processor, it breaks up a task into discrete compute units that can be assigned wherever. When doing processing in a loop, for example, if each iteration is independent, you could make each iteration a separate 'unit', like a packet of computation.
The end result is that the system can then more efficiently dole out these 'packets' without the programmer having to know about the target machine or vice-versa. For some computation, you could use all manner of different hardware - two dual-core CPUs and your programmable GPU, for example - because again, you don't need to know what it's running on. The system routes computation packets to wherever they can go, and then receives the results.
Instead of looking at a program as a series of discrete threads, each representing a concurrent task, it breaks up a program's computation into discrete chunks, and manages them accordingly. Some might have a higher priority and thus get processed first (think QoS in networking), without having to prioritize or deprioritize an entire process. If a specific packet needs to wait on I/O, then it can be put on hold until the I/O is done, and the CPU can be put back to work on another packet in the meantime.
What you get in the end is a far more granular, more practical way of thinking about computation that would scale far better as the number of processing units and tasks increases.
The problem with very long instruction word (VLIW) architectures like the EPIC and the Itanium, is that the main speed limitations in today's computers are bandwidth and latency. Memory bandwidth and latency can be the dominant performance driver in a modern processor. At a system level, network, I/O (particularly for the video), and a hard drive bandwidth and latency can dramatically affect system performance.
With a VLIW processor, you are taking many small instruction words, and gathering them together into a smaller number of much larger instruction words. This never pays off. Essentially, it is impossible to always use all of the larger instruction words. Even with a normal super-scalar processor, it is almost impossible to get every functional unit on the chip to do something simultaneously. The same problem applies with VLIW processors. Most of the time, a program is only exercising a specific area of the chip. With VLIW, this means that many bits in the instruction word will go unused much of the time.
In and of itself, wasting bits in an instruction word isn't a big deal. Modern processors can move large amounts of memory simultaneously, and it is handy to be able to link different sections of the instruction word to independent functional blocks inside the processor. The problem is the longer instruction words use memory bandwidth every time they are read. Worse, the longer instruction words take up more space in the processor's cache memory. This either requires a larger cache, increasing the processor cost, or it increases latency, as it translates into fewer cache hits. It is no accident the Itanium is both expensive and has an unusually large on-chip cache.
The other major downfall of the VLIW architecture is that it cannot emulate a short instruction word processor quickly. This is a problem both for interpreters and for 80x86 emulation. Interpreters are a very popular application paradigm. Many applications contain them. Certain languages, like .NET and Java, use pseudo-interpreters/compilers. 80x86 emulation is a big deal, as the majority of the worlds software is written for an 80x86 platform, which features a complex variable length instruction word. The long VLIW instructions are unable to decode either the short 80x86 instructions, or the Java JIT instruction set, quickly. Realistically, a VLIW instruction processor will be no quicker, on a per instruction basis, than an 80x86 processor, despite the fact the VLIW architecture is designed to execute 4 instructions simultaneously.
The memory bandwidth problem, and the fact that VLIW processors don't lend themselves to interpreters, really slows down the usefulness of the platform.
You're thinking too simply. A single-core system at 5GHz would be less-responsive for most users than a dual-core 2GHz. Here's why:
While you're playing a game more programs are running in the background - anti-virus, defrag, email, google desktop, etc. Also, any proper, modern game splits it's tasks, e.g. game AI, physics, etc.
So dual-core is definitely a huge step up from single. So, no, users don't want single-core, they want a faster more responsive pc, which NOW is dual-core. In a few years it will be quad core. Most now hardly benefit from quad core.
To dumb your message down, CPU manufacturers act like book publishers [...]
What is this "books" crap? Pft, I remember when car analogies were good enough for everyone. Now you have to get all fancy. Let me try and explain it more clearly:
CPUs are like cars. Intel and Friends haven't been able to keep increasing the velocity they can safely and reliably run, so instead of relying on increased speed to get more people from point A to point B, they are instead starting to look at parallelization as a means to achieve better performance.
Now you are chopped up into 10 pieces and FedEx'd to your destination with 100 other people. Pieces may go by road, rail, air, or ship and thus overall capacity--"bandwidth" you might say--of the lanes of travel has been increased.
The only problem is that the people who make use of this new technique ("programmers", that is) have a hard time chopping you up in such a way that you can be put back together again. Usually it's a bit of a mess and more trouble that it's worth, thus we just keep driving our old-fashioned cars at normal speeds while adding lanes to the roads.
"What do you despise? By this are you truly known." --Princess Irulan, Manual of Muad'Dib
/)
This is simply not true. Assuming both cores are fully loaded, which is the best possible case for dual core, then they will still be performing context switches at the same rate as a single chip if you are running more than one process per core. Even if you had the perfect theoretical case for two cores, where you have two independent processes and never context switch, you could run them much faster on the single-core machine. A single-core 5GHz CPU would have to waste 20% of its time on context switching to be slower than a dual-core 2GHz CPU, while a real CPU will spend less than 1% (and even on the dual-core CPU, most of the time your kernel will be preempting the process every 10ms, checking if anything else needs to run, and then scheduling it again, so you don't save much).
The only way the dual core processor would be faster in your example would be if it had more cache than the 5GHz CPU and the working set for your programs fitted into the cache on the dual-core 2GHz chip but not on the 5GHz one, but that's completely independent of the number of cores.
I am TheRaven on Soylent News
It's posts like these that make me think that I'm the only one with 7 programs on the task bar, 12 in the system tray, assorted server processes, and 32 tabs open in Firefox (come on, 1 thread per tab!!). It doesn't much matter to me if each of these parts are not multithreaded, as long as the OS is smart enough to put active threads on different cores.
Three Cores for the Mozilla-kings under the GUI,
Seven for the Gnome-lords in their halls of X,
Nine for KDE Men doomed to be flamed,
One for the Free Scheduler on his free kernel
In the Land of Linux where the SMP lie.
One Core to rule them all, One Core to find them,
One Core to bring them all and in the scheduler bind them
In the Land of Linux where the SMP lie.
Which is, of course, what will eventually happen if the number of cores keep increasing: we'll need one dedicated exclusively to manage what goes where and when. Which is pretty cool when you think about it ;)
No problem is insoluble in all conceivable circumstances.