Multithreading - What's it Mean to Developers?
sysadmn writes "Yet another reason not to count Sun out: Chip Multithreading. CMT, as Sun calls it, is the use of hardware to assist in the execution of multiple simultaneous tasks - even on a single processor. This excellent tutorial on Sun's Developer site explains the technology, and why throughput has become more important than absolute speed in the enterprise.
From the intro: Chip multi-threading (CMT) brings to hardware the concept of multi-threading, similar to software multi-threading. ... A CMT-enabled processor, similar to software multi-threading, executes many software threads simultaneously within a processor on cores. So in a system with CMT processors, software threads can be executed simultaneously within one processor or across many processors. Executing software threads simultaneously within a single processor increases a processor's efficiency as wait latencies are minimized. "
this makes me wonder what the effect would be on something like stackless python?
the whole state pickling concept is pretty cool, and kind of throws threads all over..
anime+manga together at last.. in real time.
anything i write usually maxes out the processor at 100% for days at a time (i deal with huge data conversions)
so yeah i'd also like to know: what does it mean to me?
This is Sun's Niagara Design. The more I learn about it, the more I think that it's nothing that exciting.
From the lack of non-Sun-supplied buzz regarding this technology, it would appear that many people aren't finding it very exciting.
I'm a big tall mofo.
Not sure I buy that this "increases a processor's efficiency as wait latencies are minimized". It seems to me that decreasing latency reduces efficiency because you spend a greater percentage of your cycles changing state (overhead) instead of doing useful work. This is why realtime OS'es aren't the norm: they reduce latencies to critical maximums, but at the cost of overall throughput.
[ home ]
As many others have already pointed out, Intel has had Hyperthreading available in Pentium 4 and Xeon CPUs for a couple of years now, which does exactly what the article is talking about.
I was skeptical at first, and read some of those articles showing that some applications could actually run slower. But then I tried it for myself, and I have to admit I've been impressed. My main box is a dual-Xeon, each with Hyperthreading turned on. It appears to Linux as if I have four independent CPUs. A few numerical tasks saturate the processors if I have just two of them running in parallel, but several tasks do fine with four or more copies. My favorite is "make -j 4" - starting four gcc processes in parallel works surprisingly well. How long does it take you to compile the Linux kernel?
That's fine for producer/consumer type problems, but there are other types of problems that don't lend themselves to that model.
I've been programming multithreaded code for a while, too, and giant locking (which is what you describe) is not very efficient much of the time for what I've done in the past. Linux and Solaris had this type of architecture for the kernel at one time and they've long since evolved away from that.
In short, how you use threads really depends on what you are trying to do. Hammering all multi-threaded programming into this one model may not be efficient or easy. That model does serve nicely for a number of tasks, but not all.
The moderators are on crack, today. Intel's hyperthreading is more of a marketing gimmik (which you fell for). It provides, what, a few percent improvement in performance?
The fact is that Intel's Pentiums spend most of their time _not_doing_anything_at_all_. They just sit their waiting on data.
Sun's Niagara will be able to queue 32-threads simultaneously, which 8 of those threads computing (8 cores). My guess is that Sun's analysis showed that, on average, three threads are waiting on memory while one can go forward with the data it has. This means that Sun is betting on your beloved Pentium being only 25% efficient!
I think I've also read that Sun is planning on giving Niagara obsene amounts of bandwidth to RAM. In short, if you are running a web server, for example, it would be stupid to stick with something like Pentium when something like Niagara is available.
CMT is nothing more than multi-core processors. Sun is using the marketing idea of CMT to hide the fact that the UltraSparc IV is nothing more than two UltraSparc III cores on one chip.
One way to look at this is Sun maximizing their existing engineering efforts. However, by marketing it as some revolutionary feature advance, they're implying that they've done something new and exciting, as opposed to something that IBM is already doing and AMD and Intel are working on.
Beyond that, Sun and Fujitsu have a co-manufacturing and R&D deal now, confirming something those in the enterprise space have been saying for a long time - Fujitsu was making better Sun servers than Sun.
Plus Sun killed plans for the UltraSparc V, leaving only the Niagra. They have the Opteron line pushing up from below, and rapidly evaporating sales at the high end. They're resorting to marketing gibberish to add new features to the product line, while simultaneously offloading R&D and manufacturing to a partner.
Remind me again why Sun is in the hardware business?
Thanks,
Matt
me@mzi.to
I've learned over the years that preemptive multithreading should be used only as a last resort, and even then, it's best to put exactly one synchronization point in the entire app. Self-contained tasks should be dispatched from that point and deliver their results back with little or no interaction with the other threads.
Exactly, and that's where design patterns come into play... many of these problems have been formally described in patterns you can follow to avoid this; with thread synchronization, you can use the Half-Sync/Half-Async pattern for example, and you can make a task an Active Object so it can deliver its own results...
Multi-Threaded programming is hard, very hard; but you're not alone who thinks it's hard, and many researchers have formally described a bunch of rules you can follow... if you follow these rules, you often enough eliminate most of the more complicated problems.
- Leon Mergen
http://www.solatis.com
Try make -j5 or -j6. Tends to have better results than the -j4 on my dual Xeon rig. And yes, I have benchmarked it.
My blog. Good stuff (when I remember to update it). Read it.
Sun's upcoming "Niagra" chips are supposed to have eight cores, each core being able to execute four threads. So that allows upto 32 threads executing at once -- on one physical chip.
And we're not talking about "HyperThreading" where one of the CPUs is virtual. It's a real execution unit.
And Intel and AMD are talking about dual-cores?
This should help save space and energy (both in the power needed to run the box, and in running the cooling system).
"Intel has had Hyperthreading available in Pentium 4 and Xeon CPUs for a couple of years now, which does exactly what the article is talking about"
You are wrong. Period. Sun's CMT is several independent CPU cores on the same die with a huge bandwidth interconnect on-die. Intel's Hyperthreading is a gimmicky technology that has a very small real-world impact on performance.
And your personal "benchmarks" cite no numbers. I be trolled!
-- Microsoft is the most expensive commodity operating system and office suite vendor in the marketplace.
While conceptually unrelated, I put threads into the same mental category as untyped pointers. They are extremely powerful, but a complete PITA to debug if anything goes wrong, even moreso if you are maintaining someone else's void* or pthread_create filled application.
What I've always done is code extremely defensively:
1. make the various threads data-independent enough to be free-running and only co-ordinate at the start and finish of a thread's activity. If necessary, re-architect everything in sight to make this possible.
2. when interaction is required, get a nice big coarse-grained lock and do everything that needs to be done and get it over with. profile it; there's a good chance it'll be over with quickly enough that it won't erase gains from parallelism or at least you can see what's taking so long and move it outside the lock.
3. do TONS of load testing with lots of big files and random data. thread-related bugs can often hide for years in your code. Unlike divide by zero or null pointer references, a thread bug won't necessarily give any kind of hardware fault or exception. You have to go hunt for the bugs, they won't just pop up and say hi here i am.
4. If you have multiple people of various technical abilities working on the code, you should add a grep/sed script to your makefile to check for accidental introduction of mt-unsafe library calls (strtok, ctime, etc). Flag new monitors and locks for review. Warn about dumb things like using static or global variables.
5. Last trick is to use a layer to allow your program to be compiled for fork/wait, pthread_create/pthread_join, or just plain old co-routine execution (esp if there is a socket you can set to non-blocking). In addition to being able to test your code for correctness in various situations, you also have a baseline to see if the multithreading is an actual improvement.
With the obvious exceptions for embarassingly parallel algorithms, I've found that humdrum client/server or middleware stuff:
(a) gets only marginal gains from multithreading
(b) you have to work for it--profiling and tuning are still required to get top-notch performance
(c) effectient scaling beyond a handful of threads is the exception not the rule. If you have more threads than CPU's, it's a simple fact that some of them are going to be waiting and then your scaling is done.
Just as managing memory is a "hard problem", but malloc() and free() make it safer, there are toolss that let you use threads safely and easily too.
Consider using something like OpenMP There is nothing dangerous or risky or hard to debug in examples like
If you're trying to write your own "thread create" "thread join" stuff by hand, you're wasting your time and your employer's resources in the same way as if you decided to re-write your own garbage collector.Actually, when you think about it an improved threading model would actually strongly benefit well-programmed games. Why? Because there are a lot of semi-related processes occuring. Sound, graphics, physics, etc etc... they're all part of the game but work in very different ways.
Now if you're working with a multithreaded CPU, one processor can be handling your CPU-bound graphics work (much of this is handed off to the video card anyhow), another can be doing sound/surround mixing, etc.
In an FPS with complicated AI, you could theoretically hand that off to CPU #2 while #1 is handling different things. Your graphics engine might not have ugly-mofo-alien #235 onscreen to render, but meanwhile he's watching you and looking for a boulder that will offer him good cover to snipe you from instead of just sitting like a drone waiting for a computer-acurate headshot.
Now let's say that PC's going multi-CPU. Maybe you don't need a single superpowerful processor, just a videocard and a few lower-powerful processors. Processor #1 is handing off the environmental data, #2 is prepping it for rendering and shovelling your GPU full of vertices, #3 is playing pinpoint surround for that cricket chirping behind the rock on your far left, and #4 is doing AI for ugly alien mofo #287.
When I think about how games are advancing a lot can come down to interprocess communications and/or bandwidth limitations. The GPU still handles much of the video stuff so your CPU isn't really a bottleneck there in many cases, but as internet connections speed up then you're going to have MMORPGs, FPS's, and more chock full of "actors" that make up sight, sound, physics, and AI that could very well benefit from more CPU's rather than extra ticks on your overclocked single processor.
After all, eye-candy is only a part of realism. True realism is also very much about a multitude of things happening at once.
As many others know, you know exactly nothing about what you are talking about.
Dude, you don't know anything either. P4's hyperthreading is a two-threads implementation of Simultaneous multithreading. Niagara is an 8-way multiprocessor on a chip, and each processor has four-way simultaneous multithreading, exactly like the P4, just with more threads.
Regarding the amount of concurrent threads, it's basically equivalent to a 16-way Xeon server with hyperthreading enabled, but with much faster inter-processor communication (since it's all inside the same core), and of course much lower cost, heat dissipation, etc.
I think the most interesting part of the article was when it said "Processor speed has increased many times -- it doubles every two years, while memory is still very slow, doubling every six years."
So maybe it would be more efficent for people to stop screwing around with new processor design ideas for a while and put a little effort in doubling the speed of memory access (and I don't mean by using level whatever caches). Selling motherboards with a faster memory bus would be easy, just give it a cool sounding name kind of like Sega's "Blast Processing". Let's call it "HyperRAM Technology!"
Losing faith in humanity one person at a time.
http://www.annexia.org/tmp/multithreading.ps
Rich.
libguestfs - tools for accessing and modifying virtual machine disk images
Actually, Intel's research (before HT became reality) said that on average, the instruction decoder was issuing just under 2.5 instructions per tick out of a maximum of 3... so instruction decoder throughput in single-threaded mode is about 75% of maximum.
On AMD's side, the decoder has quadruple outputs and IIRC, AMD's average is 3 out of 4 so again 75% from maximum.
By adding SMT, Intel gave the P4 the potential to keep all instruction ports busy and AMD plans to do the same next year... a single-core A64 with SMT would be interesting but we will have to settle with dual-core dual-threaded A64s and P4s which should be interesting as well.
How do AMD and Intel manage to get 75% single-threaded when we know they will be stalled by RAM? Simple, out-of-order execution - most CPUs can look 32-128 instructions ahead to find something to do while stalled, this is necessary to maximize single-thread performance and would become unnecessary if apps and CPUs became massively multi-threaded, which appears to be what Sun is gunning for.
As far as concurrent SMT is concerned, I think four threads per CPU core will turn out to be the practical maximum for desktop chips. We will probably see this happen once the A64/P4/PM are upgraded to six execution ports, three or four years from now.
The only reason Sun can think of doing a SMTx8/32 chip is because their CPUs runs at ~1GHz. At higher speeds, they would not have the necessary timing margins to fit the extra logic to efficiently shuffle execution states between "reserve" and "active" threads.
The only thing intel's hyperthreading buys you, and what most symmetric multithreading implementations buy you, is a solution to the cache miss problem. If your pipeline stalls, you simply execute the next thread in the list until you get the data you need.
Now, in some sophisticated designs, which is what I'd expected the P4 to do, was to turn the extra parallel execution units into independant ones, so you could issue 2 or 3 instructions simultaneously, and forgoe all the branch prediction, etc.
Turns out that the P4 20 stage pipeline needed help. SMT/Hyperthreading was it.