Multithreading - What's it Mean to Developers?
sysadmn writes "Yet another reason not to count Sun out: Chip Multithreading. CMT, as Sun calls it, is the use of hardware to assist in the execution of multiple simultaneous tasks - even on a single processor. This excellent tutorial on Sun's Developer site explains the technology, and why throughput has become more important than absolute speed in the enterprise.
From the intro: Chip multi-threading (CMT) brings to hardware the concept of multi-threading, similar to software multi-threading. ... A CMT-enabled processor, similar to software multi-threading, executes many software threads simultaneously within a processor on cores. So in a system with CMT processors, software threads can be executed simultaneously within one processor or across many processors. Executing software threads simultaneously within a single processor increases a processor's efficiency as wait latencies are minimized. "
I am a developper, mainly in C, and I did a lot of programation on QNX4 with multi-threading (even if QNX4 implantation is not *really* threads), now I am doing it in Precise/MQX.
Multi-threading comes with synchronization, semaphore, mutex, etc, once you know how to deal with them, it's easy.
I dont mean to look a gift horse in the mouth..
..but wouldn't it be even better if it was hyper-multi-threading?
air and light and time and space
this makes me wonder what the effect would be on something like stackless python?
the whole state pickling concept is pretty cool, and kind of throws threads all over..
anime+manga together at last.. in real time.
This is Sun's Niagara Design. The more I learn about it, the more I think that it's nothing that exciting.
From the lack of non-Sun-supplied buzz regarding this technology, it would appear that many people aren't finding it very exciting.
I'm a big tall mofo.
It means we're going to have to lean to program in parallel. We're going to have to parallelize our data processing and we're going to have to learn synchronization and locking methods.
This is nothing new. The decreasing returns and impending limits of single threaded processing has been upcoming for a long time now.
Start Running Better Polls
Well, if your data conversions are independent, multithreading might be of benefit to you if you have a hyperthreading processor.
And are you sure you are maxing the processor? Surely you have to wait for disk or network, at least some of the time. If more than 10% or so (number pulled from ass but based on empirical observations) of you time is spent waiting for latent devices, you can benefit from multithreading even on a plain vanilla single CPU system with no hyperthreading.
Can I still use INKEY in my basic programs? Will multi-threading make it more efficient? Can I actually run a second program on my DOS PC without having to force it as a TSR?
Not sure I buy that this "increases a processor's efficiency as wait latencies are minimized". It seems to me that decreasing latency reduces efficiency because you spend a greater percentage of your cycles changing state (overhead) instead of doing useful work. This is why realtime OS'es aren't the norm: they reduce latencies to critical maximums, but at the cost of overall throughput.
[ home ]
It means "Difficult to reproduce bugs".
It worries me how many people just say "it means faster programs and doesn't take much more work". That mindset leads to lazy programmers who A - Can't optimize to save their jobs; and B - Don't actually understand what multithreading really does.
If you consider it easy, you've either just thrown great big global locks on most of your code, in which case your code doesn't actually parallelize well; or you've written what I refer to in my first sentence - Bugs that take an immense effort just to reproduce, nevermind track down and fix.
1.3 Simultaneous Multi-Threading
Simultaneous multi-threading [15],[16],[17] uses hardware threads layered on top of a core to execute instructions from multiple threads. The hardware threads consist of all the different registers to keep track of a thread execution state. These hardware threads are also called logical processors. The logical processors can process instructions from multiple software thread streams simultaneously on a core, as compared to a CMP processor with hardware threads where instructions from only one thread are processed on a core.
SMT processors have a L1 cache per logical processor while the L2 and L3 cache is usually shared. The L2 cache is usually on the processor with the L3 off the processor. SMT processors usually have logic for ILP as well as TLP. The core is is not only usually multi-issue for a single thread, but can simultaneously process multiple streams of instructions from multiple software threads.
1.4 Chip Multi-Threading
Chip multi-threading encompasses the techniques of CMP, CMP with hardware threads, and SMT to improve the instructions processed per cycle. To increase the number of instructions processed per cycle, CMT uses TLP [8] (as in Figure 6) as well as ILP (see Figure 5). ILP exploits parallelism within a single thread using compiler and processor technology to simultaneously execute independent instructions from a single thread. There is a limit to the ILP [1],[12],[18] that can be found and executed within a single thread. TLP can be used to improve on ILP by executing parallel tasks from multiple threads simultaneously [18],[19].
Napster-to-go says "Fill and refill your compatible MP3 player", which is a lie. It's not MP3. It's WMA with DRM.
The real issue is how large each thread can be (in the matter of memory) before it has to access data that is external to the thread. It may mean a lot for gamers running close to reality games and also for those that are doing massive calculations.
The important thing is that developers has to be aware of the possibilities and limitations around this technology. Otherwise it would be like throwing a V8 into a T-Ford. It is possible, but you would never be able to utilize the full power.
Another thing is that todays programming languages are limited. C (and C++) are advanced macro assemblers (not really bad, but it requires a lot of the programmer). Java has thread support, but it's still the programmer (in most cases) that has to decide. Java is not very efficient either, which of course is depending on which platform it's running on in combination with general optimizations. C# is Microsoft's bastard of Java and C++ with the same drawbacks as Java.
There are other languages, but most of them are either too obscure (like Erlang or Prolog) or too unknown.
The point is that a compiler shall be able to break out separate threads and/or processes whenever possible to improve performance. It is of course necessary for the programmer to hint the compiler where it may do this and where it shouldn't, but in any way try to keep the programmer luckily unknowing about the details. The details may depend on the actual system where the application is running. i.e. if the system is busy with serving a bunch of users then the splitting of the application into a bunch of threads is ot really what you want, but if you are running alone (or almost alone) then the application should be permitted to allocate more resources. The key is that the allocation has to be dynamic.
Anybody knowing of any better languages?
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
This is what it means for me: http://www.cs.bell-labs.com/who/rsc/thread/
Also see Brian W. Kernighan's "A Descent into Limbo" and Dennis M. Ritchie's "The Limbo Programming Language".
And of course Hoare's classic: Communicating Sequential Processes.
Now you can enjoy the power and beauty of the CSP model in Linux and other Unixes thanks to plan9port including libthread and Inferno; yes, it's all Open Source.
"When in doubt, use brute force." Ken Thompson
There are some significant differences between hyperthreading and Suns approach.
Tiny amount of background:
Hardest part when trying to run things in parallel is figuring out what you can run in parallel. Example: two operations (pseudocode): c=a+b and d+c+e. These two cannot be run in parallel, since you need to result of a+b before you can start c+e.
With modern operating systems there are many programs running at one time, and they may contain seperate threads. One assumption of threading is that threads can run asynchronously to one another - you will not get a situtation like that above (okay, okay, I'm simplying!).
With Hyperthreading, Intel gets the CPU to pretend to the OS that there are actually two of them. They duplicate the fetch and decode units, but only use one execute unit - which probably has several FPUs and Integer units. They rely on an FPU or an Integer unit being available to be able to get a performance benefit.
So Intel (up til now) have duplicated the fetch and decode, but still had the same execute unit.
Suns approach is to replicate the whole pipeline - fetch, decode, execute. Intel can't really scale hyperthreading beyond two "processors", whereas Sun are aiming to try and execute 8, 16 or even more at one time.
Because of Intels architecture they can't really scale hyperthreading in this way - for lots of reasons. I'm sure other people can add them.
This really won't be of huge benefit to your Doom3 FPS, but for business apps (think J2EE) or message queues or science applications it will allow compute servers to scale better at heavy loads (i.e. when lots of threads are doing something that isn't IO bound, at the same time).
[ Monday is a terrible way to spend one seventh of your life. ]
Since I mostly work on J2EE stuff, I let the container take care of the threading for me. The one exception is J2EE Connector Architecture (JCA) bits that use the work manager. Even there, however, most of my work is simply putting a thin JCA layer in place between the outside world and the J2EE stack.
For me, these new chips simply mean increased performance for deployed apps, without any modification to the app code.
Beauty!
668: Neighbour of the Beast
The moderators are on crack, today. Intel's hyperthreading is more of a marketing gimmik (which you fell for). It provides, what, a few percent improvement in performance?
The fact is that Intel's Pentiums spend most of their time _not_doing_anything_at_all_. They just sit their waiting on data.
Sun's Niagara will be able to queue 32-threads simultaneously, which 8 of those threads computing (8 cores). My guess is that Sun's analysis showed that, on average, three threads are waiting on memory while one can go forward with the data it has. This means that Sun is betting on your beloved Pentium being only 25% efficient!
I think I've also read that Sun is planning on giving Niagara obsene amounts of bandwidth to RAM. In short, if you are running a web server, for example, it would be stupid to stick with something like Pentium when something like Niagara is available.
CMT is nothing more than multi-core processors. Sun is using the marketing idea of CMT to hide the fact that the UltraSparc IV is nothing more than two UltraSparc III cores on one chip.
One way to look at this is Sun maximizing their existing engineering efforts. However, by marketing it as some revolutionary feature advance, they're implying that they've done something new and exciting, as opposed to something that IBM is already doing and AMD and Intel are working on.
Beyond that, Sun and Fujitsu have a co-manufacturing and R&D deal now, confirming something those in the enterprise space have been saying for a long time - Fujitsu was making better Sun servers than Sun.
Plus Sun killed plans for the UltraSparc V, leaving only the Niagra. They have the Opteron line pushing up from below, and rapidly evaporating sales at the high end. They're resorting to marketing gibberish to add new features to the product line, while simultaneously offloading R&D and manufacturing to a partner.
Remind me again why Sun is in the hardware business?
Thanks,
Matt
me@mzi.to
As many others have already pointed out, Intel has had Hyperthreading available in Pentium 4 and Xeon CPUs for a couple of years now, which does exactly what the article is talking about.
As many others know, you know exactly nothing about what you are talking about. HT has basically two sets of registers so that during a cache miss which would cuase a bubble the chip switches to the other set so it doesn't sit idle. Suns chip on the other hand actually have multiple corses physically doing work at the same time. In fact were it not for Intel's hideously flawed NetBurst architecture the hideous hack that is HyperThreading would not provide any preformance increase at all (in fact it doesn't as much provide an increase as much as negate a decrease...). For evidence consider how many Pentium Ms have HT on them... Now I may not be fully correct but I didn't volunteer a comment; I only posted to prevent the misinformation of others. You'll find more on ArsTechnica. I'd link to the article but I can't find anything on their redesigned site.
Your CPU is not doing anything else, at least do something.
Sun's upcoming "Niagra" chips are supposed to have eight cores, each core being able to execute four threads. So that allows upto 32 threads executing at once -- on one physical chip.
And we're not talking about "HyperThreading" where one of the CPUs is virtual. It's a real execution unit.
And Intel and AMD are talking about dual-cores?
This should help save space and energy (both in the power needed to run the box, and in running the cooling system).
there are still some applications where raw CPU speed matters.
We have been at the thoughtput is good enough point for several years. In truth, this is old news really. I've got IRIX servers doing lots of things plenty fast, clipping along at a brisk 400Mhz. There is not much you can't do with that, particularly when running a nice NUMA box.
I assume the same holds true for SUN gear. (I think their NUMA performance is a bit lower than the SGI, but I also don't think it matters for a lot of enterprise stuff.)
One application I have running, NUMA style, is MCAD. It's cool in that I have one copy of the software serving about 25 users, running on a nice NUMA server that never breaks. Admin is almost zero, except for the little things that happen from time to time --mostly user related.
However, I'm going to have to migrate this to a win32 platform. (And yes, it's gonna suck.) Why? The peak CPU power available to me is not enough for very large datasets and I cannot easily make the data portable for roaming users. (If there were more MCAD on Linux, I could do this, alas...)
Love it or hate it, the hot running, inefficient Intel / AMD cpu delivers more peak compute than any high I/O UNIX platform does. And it's cheap.
Sun is stating the obvious with the whole I/O thing, IMHO. In doing so, they avoid a core problem; namely, peak compute is not an option under commercial UNIX that needs to be. (And where it is, there are no applications, or the cost is just too high...)
This is where Linux is really important. It runs on the fast CPU's, but also is plenty UNIXey to allow smart admins to capture the benefits multi-user computing can provide.
Linux rocks, so does Solaris, IRIX, etc... The difference is that I can get IRIX & solaris applications.
WISH THAT WOULD CHANGE FASTER THAN IT CURRENTLY IS.
Blogging because I can...
"Intel has had Hyperthreading available in Pentium 4 and Xeon CPUs for a couple of years now, which does exactly what the article is talking about"
You are wrong. Period. Sun's CMT is several independent CPU cores on the same die with a huge bandwidth interconnect on-die. Intel's Hyperthreading is a gimmicky technology that has a very small real-world impact on performance.
And your personal "benchmarks" cite no numbers. I be trolled!
-- Microsoft is the most expensive commodity operating system and office suite vendor in the marketplace.
While conceptually unrelated, I put threads into the same mental category as untyped pointers. They are extremely powerful, but a complete PITA to debug if anything goes wrong, even moreso if you are maintaining someone else's void* or pthread_create filled application.
What I've always done is code extremely defensively:
1. make the various threads data-independent enough to be free-running and only co-ordinate at the start and finish of a thread's activity. If necessary, re-architect everything in sight to make this possible.
2. when interaction is required, get a nice big coarse-grained lock and do everything that needs to be done and get it over with. profile it; there's a good chance it'll be over with quickly enough that it won't erase gains from parallelism or at least you can see what's taking so long and move it outside the lock.
3. do TONS of load testing with lots of big files and random data. thread-related bugs can often hide for years in your code. Unlike divide by zero or null pointer references, a thread bug won't necessarily give any kind of hardware fault or exception. You have to go hunt for the bugs, they won't just pop up and say hi here i am.
4. If you have multiple people of various technical abilities working on the code, you should add a grep/sed script to your makefile to check for accidental introduction of mt-unsafe library calls (strtok, ctime, etc). Flag new monitors and locks for review. Warn about dumb things like using static or global variables.
5. Last trick is to use a layer to allow your program to be compiled for fork/wait, pthread_create/pthread_join, or just plain old co-routine execution (esp if there is a socket you can set to non-blocking). In addition to being able to test your code for correctness in various situations, you also have a baseline to see if the multithreading is an actual improvement.
With the obvious exceptions for embarassingly parallel algorithms, I've found that humdrum client/server or middleware stuff:
(a) gets only marginal gains from multithreading
(b) you have to work for it--profiling and tuning are still required to get top-notch performance
(c) effectient scaling beyond a handful of threads is the exception not the rule. If you have more threads than CPU's, it's a simple fact that some of them are going to be waiting and then your scaling is done.
More like none of Sun's competitors have anything which comes remotely close.
Notice how nearly a year after Sun announced this, intel finally admitted that clock frequency (i.e. gigahertz) isn't everything and that they'd be bringing out dual core processors?
Niagara has 8 cores each capable of 0-clock cycle latency switching between 4 different thread contexts.
Who else has working hardware and an OS to go that can do this?
Stick Men
Actually, when you think about it an improved threading model would actually strongly benefit well-programmed games. Why? Because there are a lot of semi-related processes occuring. Sound, graphics, physics, etc etc... they're all part of the game but work in very different ways.
Now if you're working with a multithreaded CPU, one processor can be handling your CPU-bound graphics work (much of this is handed off to the video card anyhow), another can be doing sound/surround mixing, etc.
In an FPS with complicated AI, you could theoretically hand that off to CPU #2 while #1 is handling different things. Your graphics engine might not have ugly-mofo-alien #235 onscreen to render, but meanwhile he's watching you and looking for a boulder that will offer him good cover to snipe you from instead of just sitting like a drone waiting for a computer-acurate headshot.
Now let's say that PC's going multi-CPU. Maybe you don't need a single superpowerful processor, just a videocard and a few lower-powerful processors. Processor #1 is handing off the environmental data, #2 is prepping it for rendering and shovelling your GPU full of vertices, #3 is playing pinpoint surround for that cricket chirping behind the rock on your far left, and #4 is doing AI for ugly alien mofo #287.
When I think about how games are advancing a lot can come down to interprocess communications and/or bandwidth limitations. The GPU still handles much of the video stuff so your CPU isn't really a bottleneck there in many cases, but as internet connections speed up then you're going to have MMORPGs, FPS's, and more chock full of "actors" that make up sight, sound, physics, and AI that could very well benefit from more CPU's rather than extra ticks on your overclocked single processor.
After all, eye-candy is only a part of realism. True realism is also very much about a multitude of things happening at once.
I think the most interesting part of the article was when it said "Processor speed has increased many times -- it doubles every two years, while memory is still very slow, doubling every six years."
So maybe it would be more efficent for people to stop screwing around with new processor design ideas for a while and put a little effort in doubling the speed of memory access (and I don't mean by using level whatever caches). Selling motherboards with a faster memory bus would be easy, just give it a cool sounding name kind of like Sega's "Blast Processing". Let's call it "HyperRAM Technology!"
Losing faith in humanity one person at a time.
Hyperthread DOES NOT HAVE ADDITIONAL FETCH and DECODE, it just permits 2 different threads to occupy the the reorder buffer thus reducing penalties as a result of a context switch, so instead of a context switch the CPU fools the OS into thinking it can issue two threads of instruction simultaneously. So fetch is designed to switch between instruction memory locations based on a turn system, so it really starts work on one thread and then in the next cycle begins work on that thread. It keeps 2 separate rename tables one for each instruction, and keeps track of which thread a given instruction is. So essentially execute is the same even the reorder buffer is almost the same but it tracks which thread an op is running on. The tricky part is getting the front end to toggle correctly between the 2 regfiles and the 2 rename tables. Also fetching from different threads of control is also tricky, I think some sort of queue is used.
Fyi, hyperthreading is used on intel because the number of instructions in-flight. The processor during a context switch interupts, saves to the stack, clears out the REGFILE, rename table, and the ROB, losing all the work accomplished that is not written back to the Regfile. So on an AMD processor this is not a huge deal, but on the P4 this is a problem because the frequent context switches that occur on modern systems cause the intel design to lose the advantage of having many instructions in flight. AMD could realize performance gains just not as much and at the cost of clockspeed.
As for CMT, no it is essentially hyperthreading but could be a better, more costly, more effective design than intels simple design. Duplication of a pipeline is a multicore chip which Sun is doing with Niagra.
The only thing intel's hyperthreading buys you, and what most symmetric multithreading implementations buy you, is a solution to the cache miss problem. If your pipeline stalls, you simply execute the next thread in the list until you get the data you need.
Now, in some sophisticated designs, which is what I'd expected the P4 to do, was to turn the extra parallel execution units into independant ones, so you could issue 2 or 3 instructions simultaneously, and forgoe all the branch prediction, etc.
Turns out that the P4 20 stage pipeline needed help. SMT/Hyperthreading was it.
Not at all, because you can add d+e. For example: