SW Weenies: Ready for CMT?
tbray writes "The hardware guys are getting ready to toss this big hairy package over the wall: CMT (Chip Multi Threading) and TLP (Thread Level Parallelism). Think about a chip that isn't that fast but runs 32 threads in hardware. This year, more threads next year. How do you make your code run fast? Anyhow, I was just at a high-level Sun meeting about this stuff, and we don't know the answers, but I pulled together some of the questions."
I see a deep schism growing in the processor industry. There are two main camps, the parallel processors, and the screemin single processors.
The parallel are used for intense processing. Research, servers, clusters, databases; anything that can be divided into many little jobs and run in parallel.
The other camp is the average user who just wants fast respons time and to play Doom 3 at 100+ fps.
FreeBSD: The Power to Serve!
Whatever the clock rate, multiply it by eight and it's pretty obvious that this puppy is going to be able to pump through a whole lot of instructions in aggregate.
Ho hum.
On a good day, with a following wind, Niagara might be able to do 8 integer instructions per second, provided it has 8 independent threads not blocking on I/O to execute.
It only has one floating-point execution unit attached to one of those 8 cores, so if you have a thread that needs to do some FP, it has to make its way over to that core and then has to be scheduled to be executed, and then it can only do one floating-point instruction.
Superb.
The thing is, all of the other CPU vendors with have super-scalar, out-of-order 2- and 4- core 64- bit processors running at over twice to three times the clock frequency.
You do the mathematics.
Stick Men
Some people have predicted this move for quite some time. I remember hearing about it back in the late 80's early 90's and I'm sure it goes way back before then. The analogy was to Steam Engines and why they lost out over Diesels. You can only make a Steam engine so big but you cannot connect them together to get more power. With diesels you can hook many of them together for more power. Chips are finally getting to the same point -- It is more cost efficient to chain them together than to create a monsterous one. I'm surprised it has take this long to get to this point.
So does this mean that Intel's gamble with the Itanium was a good one? Or does this mean that we are going to try to teach students a totally new development style for more threads and parallelism?
given the fact, that I havent programmed a single threaded program in years.
All of these recent articles about multi-cores, multiple pipelines of execution seem to miss the real value of theis technology; the provisioning of multiple Virtual Machines real-time on the same system. While most software will never use the multi-thread, multi-CPU capabilities of even the quad core AMD products like VMWare are now allowing you to dynamically provision systems on demand to deal with load. Another great use is for server consolidation; instead of 10 1U racks to handle web farming, try a 16 way box that can provide a single point of reliability, management and execution for those services. This is about horizontal scaling in a vertical fashion.
I almost thought this was going to be about Star Wars nerds being forced to watch something on Country Music Television.
Look out! It's Garth Vader!
First off, performance + java != good idea. Not trying to camp fanbois here but if you really need "down to the metal" performance you're writing in C with assembler hotspots.
/.".
So the observations that there is too much locking in Java's standard api is informative but not on-topic. the fact that the standard solution is to use a completely new class [e.g. StringBuilder] is why I laughed at my college profs when they were trying to sell their Java courses by saying "and Java is well supported with over 9000 classes!".
In the C and C++ world things get extended but also fixed at the same time. We can still use the strncat function which has been around for a while EVEN IN threaded environments...
Also, he totally fails to point out that extra threads [e.g. register sets] only pay off when the pipeline is empty. So it's a catch-22. You either have a very efficient pipeline that you can cram full of a single thread's instructions or you have a shoddy one where you're only hope is to mix in other threads.
Think about it. If you only have one ALU and 32 threads that means each individual thread works at 1/32 the normal speed. Even if they're a lower/higher priority!
That then gets into two camps. Are you threading because the performance of the pipeline sucks [e.g. dependencies in the P4] or because you want to interleave instructions [e.g. twice the clock rate but half the performance]. If it's the latter than even if you turn off 31 of 32 threads you still end up with one weak ALU.
Consider the AMD64 for instance. It usually gets an IPC that is pretty high [usually in the 1.5-2.5 range] which means that it's retiring instructions from a single thread at pretty much the entire capacity of the chip. Adding extra threads doesn't help.
Consider then the P4. It usually gets an IPC of 0.5 to 1 [for ALU code, which is observable by the fact it's about as fast as a half-clockrate Pentium-M]. This means it's two ALUs are not always busy and an additional thread could bump the IPC up to 1-1.5 range.
I know [for instance] that with HT turned on my 3.2Ghz Prescott compiles LibTomCrypt in close to the same time as my 2.2Ghz AMD64 [the P4 takes 5 seconds longer, without HT it takes about 15 seconds longer].
So the only saving grace is an efficient ALU so that you can run single tasks at least somewhat efficiently. Then tacking on the extra threads doesn't help as an efficient ALU won't have many bubbles where other threads could live.
So you end up with essentially a hardware register file but still 1/2 the performance. Remember that the goal of multi-processing is closer to 'n' times faster with n processors.
The BEST a single core multi-thread design can hope for is the performance of a single core single thread design...
Whoopy...
Multi-threading is NOT the future. Multi-cell is. Where you have dedicated special purpose [re: space optimized] side-cores that do things like "I can do MULACC/load/store REALLY REALLY QUICK!!!".
In other words, "yet another press release on
Tom
Someday, I'll have a real sig.
In games the AI of non-player-characters (-objects) can profit a lot from threading.
But for common apps
umm, better physics and AI for games is what I can think of off the thop of my head =)
Every time someone exposes concurrency at some layer as a way of improving performance, rather than because you're implementing a process that's inherently concurrent, it's a huge clusterfuck. Doesn't matter whether it's asynchronous I/O, out-of-order execution, multithreaded code, or whatever. Even when you're dealing with a concurrent environment like a graphical user interface the most successful approaches involve breaking the problem down into chunks small enough you can ignore concurrency.
One of UNIX's most important features is the pipe-and-filter model, and one of the really great things about it is that it lets you build scripts that can automatically take advantage of coarse-grained concurrency. Even on a single-CPU system, a pipeline lets you stream computation and I/O where otherwise you'd be running in lockstep alternating I/O and code.
That's where the big breakthroughs are needed: mechanisms to let you hide concurrency in a lower layer. Pipelines are great for coarse-grained parallelism, for example, but the kind of fine grain you need for Niagara demands a better design, or the parallelism needs to be shoved down to a deeper level. Intel's IA64 is kind of a lower level approach to the same thing where the compiler and CPU are supposed to find parallelism that the programmer doesn't explicitly specify, but it suffers from the typical Intel kitchen-sink approach to instruction set design.
Threads are actually one of the simplest form of parallelism to deal with and we have had decades of experience with them. That's why Sun loves them: it fits in well with their big-iron philosophy and hardware and makes it easy for their customers to migrate to the next generation.
But the future of high-end computing, both in business and in science, will not look like that. Networks of cheap computing nodes scale better and more cost-effectively. Many manufacturers have already gone over to that for their high-end designs. That's where the real software challenges are, but they are being addressed.
Processors with lots of thread parallelism will probably be useful in some niche applications, but they will not become a staple of high-end computing.
and other such languages will become more popular as this new multithreaded world takes hold because they embed the multithreaded concepts into the language without explicit programmer interaction. C, C++, Java style threading and mutex constructs are error-prone and awkward to use.
It sounds like you want a cell.
and take some advance architecture courses.
The BEST a single core multi-thread design can hope for is the performance of a single core single thread design...
I'm sorry but that turns out not to be the case.
When you have a system that is running lots of different threads simultaneously the amount of time that it takes to do a context switch from one thread to another becomes an issue. In the real world, threads often do things like I/O which cause them to block or they wait on a lock. If you can do a fast context switch you get back the time that you would have wasted saving registers off to RAM and pulling back another set. Faster thread switching means that your multi-thread single core now runs its total load (all of the threads) faster than a single core single thread design. Also, things like microkernels become a lot more feasible (microkernels are notorious for being slow because context switches are slow).
When you have looked beyond your desktop machine maybe you'll have earned the right to sneer at your professors. I don't think you're there yet.
Sparc playing catch-up? It's x86 that's playing catch-up to the proprietary RISC vendors. UltraSparc IV processors have multiple cores like the new AMD and Pentiums for the past year or two. POWER4 from IBM started shipping with four cores when it came out several years ago. HP's PA-RISC has been dual-cored for a while. I think POWER4 has SMT, and I know POWER5 does. Even before HP and Compaq merged, the next Alpha chip, the EV8 was going to have some impressive SMT, also.
The only way that x86 is ahead is clockspeed, due to aggressive production technology.
How can a true Slashdot geek not be looking forward to this? It's something new and different. I'll never own one and possibly never work with one, but I'm curious to see exactly how such a design performs, because it's a lot different from a single 3.6 ghz Pentium 4. Don't you want to at least see how it does before dismissing it? Unless you have stock in Sun or a bizaare emotional investment in processors, what's the harm in Sun spending their money on this product?
From the parent post:
"Current programming languages are insufficiently descriptive to permit compilers to generate usefully multi-threaded code."
The portion of importance is:
"insufficiently descriptive"
In C, C++, and Java, you must program with concurrency in mind to obtain any benefit from multiple threads of execution. In a functional programming language, the restrictions placed on the behavior of functions often imply concurrency without the programmer necessarily intending that as the result. If you write a C program without concurrency in mind and want to adapt your solution later to take advantage of multiple threads, you may need to code a completely different solution and also locate a compiler that knows how to take advantage of concurrency. In a functional language, you may only need to get an updated version of your compiler/interpreter. This is why C, C++, and Java are in the "insufficiently descriptive" category and functional programming languages are not.
The reason that the simulation results coming from the original UWashington research on the subject - http://www.cs.washington.edu/research/smt/ - looked far better was their use of unreasonably large caches in their simulations, and that they completely ignored the OS overhead of enabling SMT - which is non-negligeable - and is a thing that has been pointed out often on the Linux Kernel mailing list as well.
I didn't read most of the princeton paper... but you're arguing that caches need to be big to get any gains, and that Intel's HT chips show SMT doesn't offer anything. The Intel chips have ridiculously small L1 caches - only 8KB. A quick sampling of washington papers shows they simulate machines with 64-128KB L1 caches, which are entirely reasonable - all AMD processors since the Athlon have had 64KB L1 caches. Both companies are increasing L2 sizes, and 1-2MB is not unreasonable either.
I don't know anything about OS overhead, but section 2.3 of the princeton paper argues SMP kernels (which SMT requires) are slower, and thus you pay for extra overhead when using SMT vs a non-multithreaded single processor. However, they themselves don't make the same claim for multiprocessors (because you have to pay the OS overhead anyway), and with the introduction of dual core processors at the consumer level, everybody will soon be using the SMP kernels anyway. This point is [rapidly becoming, if not already] moot.
Their analysis in section 3.3 implied that the memory subsystem becomes the bottleneck in multiprocessor systems with SMT enabled, but before you take that and agrue SMT offers nothing, I again point out problems with the Intel implementation: their memory bus is shared among all CPUs, so the per-CPU bandwidth drops with an increase in CPUs, and per-thread bandwidth is half again. AMD's Opterons don't suffer from this same problem due to their NUMA configuration, so a 2-CPU 2-thread SMT with an Opteron-like memory system would get the same per-thread memory bandwidth as a 2-CPU non-SMT Xeon system, while supporting twice as many threads.
My server