SW Weenies: Ready for CMT?
tbray writes "The hardware guys are getting ready to toss this big hairy package over the wall: CMT (Chip Multi Threading) and TLP (Thread Level Parallelism). Think about a chip that isn't that fast but runs 32 threads in hardware. This year, more threads next year. How do you make your code run fast? Anyhow, I was just at a high-level Sun meeting about this stuff, and we don't know the answers, but I pulled together some of the questions."
Now my hardware will force me to support CMT on my computer? This is as bad as DRM.
I see a deep schism growing in the processor industry. There are two main camps, the parallel processors, and the screemin single processors.
The parallel are used for intense processing. Research, servers, clusters, databases; anything that can be divided into many little jobs and run in parallel.
The other camp is the average user who just wants fast respons time and to play Doom 3 at 100+ fps.
FreeBSD: The Power to Serve!
Whatever the clock rate, multiply it by eight and it's pretty obvious that this puppy is going to be able to pump through a whole lot of instructions in aggregate.
Ho hum.
On a good day, with a following wind, Niagara might be able to do 8 integer instructions per second, provided it has 8 independent threads not blocking on I/O to execute.
It only has one floating-point execution unit attached to one of those 8 cores, so if you have a thread that needs to do some FP, it has to make its way over to that core and then has to be scheduled to be executed, and then it can only do one floating-point instruction.
Superb.
The thing is, all of the other CPU vendors with have super-scalar, out-of-order 2- and 4- core 64- bit processors running at over twice to three times the clock frequency.
You do the mathematics.
Stick Men
Some people have predicted this move for quite some time. I remember hearing about it back in the late 80's early 90's and I'm sure it goes way back before then. The analogy was to Steam Engines and why they lost out over Diesels. You can only make a Steam engine so big but you cannot connect them together to get more power. With diesels you can hook many of them together for more power. Chips are finally getting to the same point -- It is more cost efficient to chain them together than to create a monsterous one. I'm surprised it has take this long to get to this point.
and we don't know the answers, but I pulled together some of the questions."
What is this now, Questions for Nerds. Stuff we dont know?
Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
What truth?
There is no dupe
from TFA:
"Problem: Legacy Apps You'd be surprised how many cycles the world's Sun boxes spend running decades-old FORTRAN, COBOL, C, and C++ code in monster legacy apps that work just fine and aren't getting thrown away any time soon. There aren't enough people and time in the world to re-write these suckers, plus it took person-centuries in the first place to make them correct.
Obviously it's not just Sun, I bet every kind of computer you can think of carries its share of this kind of good old code. I guarantee that whoever wrote that code wasn't thinking about threads or concurrency or lock-free algorithms or any of that stuff. So if we're going to get some real CMT juice out of these things, it's going to have to be done automatically down in the infrastructure. I'd think the legacy-language compiler teams have lots of opportunities for innovation in an area where you might not have expected it."
I almost thought this was going to be about Star Wars nerds being forced to watch something on Country Music Television.
On a good day, with a following wind, Niagara might be able to do 8 integer instructions per second, I meant per clock cycle, of course, not per second.
The thing is, all of the other CPU vendors with have
I meant "will have" not "with have".
Stick Men
Well I am a Star Wars weenie, and I am definitely NOT ready for Country Music Television.
given the fact, that I havent programmed a single threaded program in years.
As a scientific programmer, all I know is that this will eventually be a huge benefit to all my MPI and OpenMP codes.
I really only know the "scientific" programming languages, but most all math specific routines are already written for parallel machines. I'm a bit curious, what else really needs multiple threads? Isn't the benefit of dual-core procs the ability to not have a slow-down when you run two or three apps at a time? Don't games like DOOM III and Half-Life II depend mostly on the GPU (which I'm guessing they can handle multiple core GPU's since the programming should be fairly similar to SLI?)? What is the benefit in games? Just faster level loading times?
I don't want to sound like I'm whining or anything here... I'm not saying that multiple cores suck. On the contrary they're fantastic for what I do, but I just was hoping you guys could help me understand how common apps and non-mathematical operations can use them.
CMT is manufactured pop-country music at its worst. Yuck!
the good ground has been paved over by suicidal maniacs
"The hardware guys are getting ready to toss this big hairy package over the wall:"
Vivid imagary...
32 threads in hardware on one chip is the same as 32 slow CPUs.
Current programming languages are insufficiently descriptive to permit compilers to generate usefully multi-threaded code.
Accordingly, multi-threading is currently handled by the programmer; which by and large doesn't happen, because programmers are not used to it.
A lot of applications these days are weakly multi-threaded - Windows apps for example often have one thread for the GUI, another for their main processing work.
This is *weak* multi-threading; because the main work done occurs within a single thread. Strong multi-threading is when the main work is somehow partioned so that it is processed by several threads. This is difficult, because a lot of tasks are inherently essentially serial; stage A must complete before stage B which must complete before stage C.
The main technique I'm aware of for making good use of multi-threading support is that of worker-thread farms. A main thread receives requests for work and farms them out to worker threads. This approach is useful only for a certain subset of problem types, however, and within the processing of *each* worker thread, the work done itself remains essentially serial.
In other words, clock speeds have hit the wall, transistor counts are still rising, the only way to improve performance is to have more CPUs/threads, but programming models don't yet know how to actually *use* multiple CPU/threads.
El problemo!
--
Toby
You can have your parallel processors and still play DOOM III at insane fps. At worst, it will just take a bit for folks to start writing programs to take advantage of the additional processors/cores.
BTW, your "average" user hasn't even played DOOM I, let alone DOOM III. Surfing the web and using e-mail doesn't usually put a lot of strain on a PC.
Now of course, the room was full of Sun infrastructure weenies, so if there's something terribly obvious in records management or airline reservations or payroll processing that doesn't parallelize, we might not know about it.
Well, since I work in airline reservations systems, I'll add my $0.02 worth...
Most OLTP systems will benefit from CMT and multi-core processors. We had a test server from AMD about a month before the dual-core Opteron was announced, we did some initial testing and then put it in the production cluster and fired it up. No code changes, no recompile, no drama.
IMHO, the single-user applications, such as games and word processors, will be harder to parallelize.
Alan.
First off, performance + java != good idea. Not trying to camp fanbois here but if you really need "down to the metal" performance you're writing in C with assembler hotspots.
/.".
So the observations that there is too much locking in Java's standard api is informative but not on-topic. the fact that the standard solution is to use a completely new class [e.g. StringBuilder] is why I laughed at my college profs when they were trying to sell their Java courses by saying "and Java is well supported with over 9000 classes!".
In the C and C++ world things get extended but also fixed at the same time. We can still use the strncat function which has been around for a while EVEN IN threaded environments...
Also, he totally fails to point out that extra threads [e.g. register sets] only pay off when the pipeline is empty. So it's a catch-22. You either have a very efficient pipeline that you can cram full of a single thread's instructions or you have a shoddy one where you're only hope is to mix in other threads.
Think about it. If you only have one ALU and 32 threads that means each individual thread works at 1/32 the normal speed. Even if they're a lower/higher priority!
That then gets into two camps. Are you threading because the performance of the pipeline sucks [e.g. dependencies in the P4] or because you want to interleave instructions [e.g. twice the clock rate but half the performance]. If it's the latter than even if you turn off 31 of 32 threads you still end up with one weak ALU.
Consider the AMD64 for instance. It usually gets an IPC that is pretty high [usually in the 1.5-2.5 range] which means that it's retiring instructions from a single thread at pretty much the entire capacity of the chip. Adding extra threads doesn't help.
Consider then the P4. It usually gets an IPC of 0.5 to 1 [for ALU code, which is observable by the fact it's about as fast as a half-clockrate Pentium-M]. This means it's two ALUs are not always busy and an additional thread could bump the IPC up to 1-1.5 range.
I know [for instance] that with HT turned on my 3.2Ghz Prescott compiles LibTomCrypt in close to the same time as my 2.2Ghz AMD64 [the P4 takes 5 seconds longer, without HT it takes about 15 seconds longer].
So the only saving grace is an efficient ALU so that you can run single tasks at least somewhat efficiently. Then tacking on the extra threads doesn't help as an efficient ALU won't have many bubbles where other threads could live.
So you end up with essentially a hardware register file but still 1/2 the performance. Remember that the goal of multi-processing is closer to 'n' times faster with n processors.
The BEST a single core multi-thread design can hope for is the performance of a single core single thread design...
Whoopy...
Multi-threading is NOT the future. Multi-cell is. Where you have dedicated special purpose [re: space optimized] side-cores that do things like "I can do MULACC/load/store REALLY REALLY QUICK!!!".
In other words, "yet another press release on
Tom
Someday, I'll have a real sig.
I would be far more interested in taking advantage of all the CPU cycles that run all over at Businesses.
Condor.
That's really a shame about the FP performance. My hobby project is ray tracing, and my code is just waiting to be run on parallel hardware. The prefered system would have multiple cores sharing cache, but seperate cache would be fine too. memory is not the bottleneck, so higher GHz and more cores/threads will be very welcome so long as they each have good performance. The code scales well with multiple CPUs as pixels can be rendered in parallel with zero effort - the code was designed for that. As it sits, I'm hoping my Shuttle (SN95G5v2) will support a AMD64x2 shortly. We're still not up for RT Quake, but interactive (read very jerky 1-2 fps) high-poly scenes are possible today.
Every time someone exposes concurrency at some layer as a way of improving performance, rather than because you're implementing a process that's inherently concurrent, it's a huge clusterfuck. Doesn't matter whether it's asynchronous I/O, out-of-order execution, multithreaded code, or whatever. Even when you're dealing with a concurrent environment like a graphical user interface the most successful approaches involve breaking the problem down into chunks small enough you can ignore concurrency.
One of UNIX's most important features is the pipe-and-filter model, and one of the really great things about it is that it lets you build scripts that can automatically take advantage of coarse-grained concurrency. Even on a single-CPU system, a pipeline lets you stream computation and I/O where otherwise you'd be running in lockstep alternating I/O and code.
That's where the big breakthroughs are needed: mechanisms to let you hide concurrency in a lower layer. Pipelines are great for coarse-grained parallelism, for example, but the kind of fine grain you need for Niagara demands a better design, or the parallelism needs to be shoved down to a deeper level. Intel's IA64 is kind of a lower level approach to the same thing where the compiler and CPU are supposed to find parallelism that the programmer doesn't explicitly specify, but it suffers from the typical Intel kitchen-sink approach to instruction set design.
I'll misquote Fred Weigel and suggest that the next problem is branching: Samba code seems to generate 5 instructions between branches, so suspending the process and running something else intil the branch target is in I-cache seems like A Good Thing (;-)).
Methinks Samba would really enjoy a CMT processor.
--dave
davecb@spamcop.net
Threads are actually one of the simplest form of parallelism to deal with and we have had decades of experience with them. That's why Sun loves them: it fits in well with their big-iron philosophy and hardware and makes it easy for their customers to migrate to the next generation.
But the future of high-end computing, both in business and in science, will not look like that. Networks of cheap computing nodes scale better and more cost-effectively. Many manufacturers have already gone over to that for their high-end designs. That's where the real software challenges are, but they are being addressed.
Processors with lots of thread parallelism will probably be useful in some niche applications, but they will not become a staple of high-end computing.
Seriously! And why foist this garbage on the Star Wars (SW) weenies? Has John Williams gone country?
Easy. In present days there are some assembly instructions that can be executed simultaneously. With a chip like this however, all bets would be off. Instead of just a meager few instructions that could be executed simultaneously you would be able to execute any number of instructions simultaneously.
So if you have a function that say does 10 additions and 10 moves you would first figure out if any of them needed to be done before or after each other. Then see which ones don't matter. Then write the function to do as many at once as possible.
It really doesn't matter for anyone other than the compiler writers. Those guys will write the compiler to do this kind of assembly level optimization for you. The trick is writing a high level language, or modifying an existing one, so the compiler can tell which things must be executed in order and which can be executed side by side.
The GeekNights podcast is going strong. Listen!
If single-threaded performance improvements slow down, and the available computing power is spread out among multiple cores, anyone persisting in writing single-threaded code will fall behind in performance.
Remember the old days when people used fancy tricks to implement naturally concurrent solutions as single-threaded programs? The future is going to be just the opposite. Any day now we'll see a rush toward langages with special support for quick, clear, safe parallelism, just like we've seen scripting languages catch on for web programming.
IBM started SHIPPING Power5 with SMT capablility August 31 of last year - IBM has SMT running on 1.9 GHz processors today. Sun is getting farther and farther behind.
-The Mad Duke
That is hard to say. EPIC is a very long instruction word architecture (VLIW) which supports up to 3 concurrent non-interfering instructions which requires static (compile time) scheduling, since the instructions must be in contiguous memory. Getting efficient scheduling is hard, since the complexity is pushed back on the compiler, which may need to do some serious code reordering. Additionally, EPIC was designed to support speculative execution, which has efficiency issues if the wrong prediction is made. Additionally, EPIC had a new instruction set/core so Intel may not have gotten as much reuse of existing designs that multithreaded (using register bank switching) or multi core designs might have been able to exploit. Modern fabrication and design is so complex, that widely used designs get development resources and new interesting directions often don't get fabricated.
From the article:
The standard APIs that came with the first few versions of Java were thread safe; some might say fanatically, obsessively, thread-safe. Stories abound of I/O calls that plunge down through six layers of stack, with each layer posting a mutex on the way; and venerable standbys like StringBuffer and Vector are mutexed-to-the-max. That means if your app is running on next year's hot chip with a couple of dozen threads, if you've got a routine that's doing a lot of string-appending or vector-loading, only one thread is gonna be in there at a time.
Classes such as StringBuffer and Vector are locked (synchronized) on a per-object basis. As long as you aren't trying to access the same object from different threads you won't block. And if you are trying to access the same object from different threads you will be happy that they were thread-safe!
The performance problems of having these classes being obsessive about thread safety do not result from the locking forcing singlethreadedness. The performance problem stem from the cost of locking objects.
Hardware threading has been mainstream for more than two years in the form of HyperThreading.
Simultaneous Multi-Threading is a CPU's ability to concurrently execute mixed instructions from multiple threads. Intel's HT simply 2-ways SMT.
Chip Multi-Threading is a CPU's ability to hold execution states for multiple threads, executing instructions from only one of them at a time unless the chip is also SMT.
In Sun's case, the mid-term plan is to eventually offer 8-ways SMT with 32-ways CMT: the CPU can hold states for up to 32 threads and have in-flight instructions from as many as eight of them.
and take some advance architecture courses.
The BEST a single core multi-thread design can hope for is the performance of a single core single thread design...
I'm sorry but that turns out not to be the case.
When you have a system that is running lots of different threads simultaneously the amount of time that it takes to do a context switch from one thread to another becomes an issue. In the real world, threads often do things like I/O which cause them to block or they wait on a lock. If you can do a fast context switch you get back the time that you would have wasted saving registers off to RAM and pulling back another set. Faster thread switching means that your multi-thread single core now runs its total load (all of the threads) faster than a single core single thread design. Also, things like microkernels become a lot more feasible (microkernels are notorious for being slow because context switches are slow).
When you have looked beyond your desktop machine maybe you'll have earned the right to sneer at your professors. I don't think you're there yet.
Those of you who are up on the current state of the art here, please help me out. I was under the impression that multiple threads and automatic storage management were still not on good terms with each other, and that this was a big unsolved problem.
To a Lisp hacker, XML is S-expressions in drag.
the technology and architecture were beautiful. the execution and business planning were poor. because it was such a huge and underfunded effort to get the whole thing (os, compiler, processor, network) brought up from scratch, they lagged current technology at both of the two introductions. stability was a problem.
still, a terrible shame. a testament to the failings of the short-term investment model.
the compiler did automatic parallelization, but only really well for HPC-style loop nests. if you weren't running parallel code, you really suffered, because the individual thread execution rates were so poor, and they ran uncached (one of the nice things about the model is that they used concurrency to hide memory latency, but if you didn't have it to exploit...)
I'm a researcher working on high performance computing and have used various configurations of Simultaneous Multithreading (aka Hyperthreading aka CMT) (Intel Xeon, IBM POWER5). The result is always the same - at the end, memory latencies and OS overheads kill most of the gains of instruction level parallelism coming from SMT. Look at it this way - the typical latencies of operations on most modern processors are of the order of 1 nanosecond, whereas DRAM latencies are of the order of 200ns. As long as you can't do anything about this latency, there's no point in cutting down on processing times. There's a very nice paper in this year's ACM SIGMETRICS that gives real experimental data to illustrate this fact - http://www.cs.princeton.edu/~yruan/XeonSMT/smt.pdf
The paper shows that the speedups obtained using SMT in practice are meagre. The reason that the simulation results coming from the original UWashington research on the subject - http://www.cs.washington.edu/research/smt/ - looked far better was their use of unreasonably large caches in their simulations, and that they completely ignored the OS overhead of enabling SMT - which is non-negligeable - and is a thing that has been pointed out often on the Linux Kernel mailing list as well.
It's very simple, actually, I do it quite frequently. Let's say you need to populate a drop-down based on user input in another drop-down.
At the start of the page you collect user input and fire off the data access code for the original drop and the parameterized drop, each in a seperate thread.
This executes while you're performing other formatting actions, like include headers, menu formatting, and outputting strings to your response (like client scripts, etc).
All the while, the other threads are formatting the first & second dropdown with the returned data, while your main thread is doing more menial UI tasks like formatting the tables and such that hold your page.
This is simply a basic example, but anyone who uses data access code even for a single databound table or drop should always be running it in a seperate thread and letting the main thread handle the non-data related rendering. There is a TON of work on web pages that you can be doing that is not data related, nor is it required to be peformed before or after the data is available.
Sorry, sorry, sorry...
I couldn't help it.
- None can love freedom heartily, but good men; the rest love not freedom, but license. -- John Milton
"Has John Williams gone country?"
No, that's his brother, Hank.
Beauty is in the eye of the beerholder.