The Economics of Chips With Many Cores
meanonymous writes "HPCWire reports that a unique marketing model for 'manycore' processors is being proposed by University of Illinois at Urbana-Champaign researchers. The current economic model has customers purchasing systems containing processors that meet the average or worst-case computation needs of their applications. The researchers contend that the increasing number of cores complicates the matching of performance needs and applications and makes the cost of buying idle computing power increasingly prohibitive. They speculate that the customer will typically require fewer cores than are physically on the chip, but may want to use more of them in certain instances. They suggest that chips be developed in a manner that allows users to pay only for the computing power they need rather than the peak computing power that is physically present. By incorporating small pieces of logic into the processor, the vendor can enable and disable individual cores, and they offer five models that allow dynamic adjustment of the chip's available processing power."
Your metaphor on multi-issue CPUs is interesting, but not necessarily valid.
Instruction scheduling is the biggest fundamental problem facing CPUs today. Even the best pipelined design issues only one instruction per clock, per pipeline (excluding things like macro-op fusion which combine multiple logical instructions into a single internal instruction). So we add more pipelines. But more pipelines can only get us so far - it becomes increasingly more difficult to figure out (schedule) which instructions can be executed on which pipeline at what time.
There are several potential solutions. One is to use a VLIW architecture where the compiler schedules instructions and packs them into bundles which can be executed in parallel. The problem with VLIW is that many scheduling decisions can only occur at runtime. VLIW is also highly dependent on having excellent compilers. All of these problems (among others) plagued Intel's advanced VLIW (they called it "EPIC") architecture, Itanium.
Another solution is virtual cores, or HyperThreading. HTT uses instructions from another thread (assuming that one is available) to fill pipeline slots that would otherwise be unused. The problem with HTT is that you still need a substantial amount of decoding logic for the other thread, not to mention a more advanced register system (although modern CPUs already have a very advanced register system, particularly on register-starved architectures like x86) and other associated logic. In addition, if you want to get benefits from pipeline stalls (e.g like on the P4), you need even more logic. This means that HTT isn't particularly beneficial unless you have code that results in a large number of data dependencies or branch mispredicts, or if pipeline stalls are particularly expensive.
Multicore CPUs have come about for one simple reason: we can't figure out what to do with all of the transistors we have. CPUs have become increasingly complex, yet the fabrication technology keeps marching forward, outpacing the design resources that are available. This has manifested itself in two main ways.
First, designers started adding larger and larger caches to CPUs (caches are easy to design but take up lots of transistors). But after a point, adding more cache doesn't help. The more cache you have, the slower it operates. So designers added a multi-level cache hierarchy. But this too only goes so far - as you add more cache levels, the performance delta between memory and cache decreases, because there's only a finite level of reference locality in code (data structures like linked lists don't help this). You may be able to get a single function in cache, but it's unlikely that you're going to get the whole data set used by a complex program. The net result is that beyond a certain point, adding more cache doesn't do much.
What do you do when you can't add more cache? You could add more functional units, but then you're constrained by your front-end logic again, which is a far more difficult problem to solve. You could add more front-end logic, which is what HyperThreading does. But that only helps if your functional units are sitting idle a substantial percentage of the time (as they did on the P4).
So you look at adding both functional units and more front-end logic. You'll decode many instruction streams and try to schedule them on many pipelines. This is what modern GPUs do, and for them, it works quite well. But most general-purpose code is loaded with data dependencies and branches, which makes it very difficult to schedule more than a very few (say, 4) instructions at a time, regardless of how many pipelines you have. So, now, effectively, you have one thread that is predominantly using 4 pipelines, and one that is predominantly using the other 4.
Wait, though. If one thread is mostly using one set of pipelines, and one is mostly using the other, we can split the pipelines into two groups. Each will take one thread. This way, our register and cache systems are simpler (because
I know that on Linux, I cannot immediately tell the difference between an SMP-enabled kernel on a single-core Hyperthreading system, and an SMP-enabled kernel on a dual-core system with no hyperthreading.
/proc/cpuinfo, I need an SMP kernel, etc. So if someone (Intel) suddenly decided to make a dual-core hyperthreaded design in which the "teams" actually shared a common pool, would I notice, short of Intel making an announcement?
In either case, I'm fairly sure I see at least two items in
As for your assertion, a quick scan of Wikipedia suggests that you're a bit naively wrong here. (But then, I'm the one pretending to know what I'm talking about from a quick scan of wikipedia; I suppose I'm being naive.) Wikipedia makes a distinction between Instruction level parallelism and Thread level parallelism, with advantages and disadvantages for each.
One of the advantages of thread-level parallelism is that it's software deciding what can be parallized and how. This is all the threading, locking, message-passing, and general insanity that you have to deal with when writing code to take advantage of more than one CPU. As I understand it, a pipelining processor essentially has to do this work for you, by watching instructions as they come in, and somehow making sure that if instruction A depends on instruction B, they are not executed together. One way of doing this is to delay the entire chain until instruction A finishes. Another is to reorder the instructions.
But even if you consider this a solved problem, it requires a bit of hardware to solve. I'm guessing at some point, it's easier to just throw more cores at the problem than to try to make each core a more efficient pipeline, just as it's easier to throw more cores at the problem than it is to try to make each core run faster.
There's also that user-level interface I talked about above. With multicore and no hyperthreading, the OS knows which core is which, and can distribute tasks appropriately -- idle tasks can take up half of one core, the gzip process (or whatever) can take up ALL of another core. With multicore and hyperthreading, the OS might not know -- it might simply see four cores. And with multicore, hyperthreading, and shared pipelines, it gets worse -- as I understand it, there's no longer any way, at that point, that an OS can specify which CPU a particular thread should be sent to. Threading itself may become irrelevant.
Well, anyway... What confuses me is that we still haven't adopted languages and practices that naturally scale to multiple cores. I'm not talking about complex threading models that make it easy to deadlock -- I'm talking about message-passing systems like Erlang, or wholly-functional systems like Haskell.
Hint: Erlang programs can easily be ported from single-core to multi-core to a multi-machine cluster. Haskell programs require extra work at the source code level to be made single-threaded, and can (like Make) use an arbitrary number of threads, specifiable at the commandline. They're not perfect, by far; Haskell's garbage collector is single-threaded, I think. But that's an implementation detail; most programs in C and friends, even Perl/Python/Ruby, will not be written with multiple cores in mind, and, in fact, have single-threaded implementations (or stupid things like the GIL).
Don't thank God, thank a doctor!
Having worked at nvidia, there is a reason those extra TPCs were disabled and its not because of a cripple ware model but because of yield. We cannot produce chips that are perfect all the time. So we settle for chips that are perfect a small percentage of the time, mostly perfect an ok percentage of the time, and half working a good percentage of the time. We then make 3 or 4 different series (GS/GT/GTX/GTS/Ultra) with different TPCs in each series, disable the TPCs in each chip that doesn't work or fails to pass QA and then ship them. If you unlock them, you are frying you working card because some of the faults could be things like "Oops, there was a short in the TPC because the transistors cooked too close to each other" or "Oops, the clock passes too close to the +12V in this module -- if it hits 50 Celcius, it could turn into a short". This model helps products from being prohibitively expensive for a fabless company because we are billed on "silicon wafers used" on not on "number of fault free chips produced".