Multithreading - What's it Mean to Developers?
sysadmn writes "Yet another reason not to count Sun out: Chip Multithreading. CMT, as Sun calls it, is the use of hardware to assist in the execution of multiple simultaneous tasks - even on a single processor. This excellent tutorial on Sun's Developer site explains the technology, and why throughput has become more important than absolute speed in the enterprise.
From the intro: Chip multi-threading (CMT) brings to hardware the concept of multi-threading, similar to software multi-threading. ... A CMT-enabled processor, similar to software multi-threading, executes many software threads simultaneously within a processor on cores. So in a system with CMT processors, software threads can be executed simultaneously within one processor or across many processors. Executing software threads simultaneously within a single processor increases a processor's efficiency as wait latencies are minimized. "
I am a developper, mainly in C, and I did a lot of programation on QNX4 with multi-threading (even if QNX4 implantation is not *really* threads), now I am doing it in Precise/MQX.
Multi-threading comes with synchronization, semaphore, mutex, etc, once you know how to deal with them, it's easy.
Hyperthreading makes a single processor appear as multiple processors to the OS. The OS still has to do all of the loading and storing yadayada associated with threading. From what I gather, CMT handles the threading overhead in hardware for faster context switches. Sort of reminicent of register windows on the SPARC chip.
its a nice theory in all, but im not too sure about it. if done correctly on a single threaded system, if one thread is in a wait state waiting on disc activity, then the CPU should jump threads and handle other tasks in the mean time. there is more then enough RAM and CPU Cache on modern computers that makes this quite effective. Also, isnt this what DMA channels are for? Wasn't the purpose of a DMA channel to mode data from one location to another while the CPU is performing other tasks? .... this is actually getting back to programming at the hardware level of a 386, its nothing new.
1.3 Simultaneous Multi-Threading
Simultaneous multi-threading [15],[16],[17] uses hardware threads layered on top of a core to execute instructions from multiple threads. The hardware threads consist of all the different registers to keep track of a thread execution state. These hardware threads are also called logical processors. The logical processors can process instructions from multiple software thread streams simultaneously on a core, as compared to a CMP processor with hardware threads where instructions from only one thread are processed on a core.
SMT processors have a L1 cache per logical processor while the L2 and L3 cache is usually shared. The L2 cache is usually on the processor with the L3 off the processor. SMT processors usually have logic for ILP as well as TLP. The core is is not only usually multi-issue for a single thread, but can simultaneously process multiple streams of instructions from multiple software threads.
1.4 Chip Multi-Threading
Chip multi-threading encompasses the techniques of CMP, CMP with hardware threads, and SMT to improve the instructions processed per cycle. To increase the number of instructions processed per cycle, CMT uses TLP [8] (as in Figure 6) as well as ILP (see Figure 5). ILP exploits parallelism within a single thread using compiler and processor technology to simultaneously execute independent instructions from a single thread. There is a limit to the ILP [1],[12],[18] that can be found and executed within a single thread. TLP can be used to improve on ILP by executing parallel tasks from multiple threads simultaneously [18],[19].
Napster-to-go says "Fill and refill your compatible MP3 player", which is a lie. It's not MP3. It's WMA with DRM.
Hyperthreading (which is SMT) and CMT (the original CMT, not Sun's new acronym) is at:
= RW T122600000000
http://www.realworldtech.com/page.cfm?ArticleID
It's dated a while ago, I think before hyperthreading came out (and Alpha was still being developed). The other two parts of the series are also interesting, and explain some of the possibilities with hardware processor threading. I think the first part has more explanation, but I couldn't find it quickly.
The forums on the site are also good, better in a technical sense than ars-technica or aceshardware and especially slashdot.
The real issue is how large each thread can be (in the matter of memory) before it has to access data that is external to the thread. It may mean a lot for gamers running close to reality games and also for those that are doing massive calculations.
The important thing is that developers has to be aware of the possibilities and limitations around this technology. Otherwise it would be like throwing a V8 into a T-Ford. It is possible, but you would never be able to utilize the full power.
Another thing is that todays programming languages are limited. C (and C++) are advanced macro assemblers (not really bad, but it requires a lot of the programmer). Java has thread support, but it's still the programmer (in most cases) that has to decide. Java is not very efficient either, which of course is depending on which platform it's running on in combination with general optimizations. C# is Microsoft's bastard of Java and C++ with the same drawbacks as Java.
There are other languages, but most of them are either too obscure (like Erlang or Prolog) or too unknown.
The point is that a compiler shall be able to break out separate threads and/or processes whenever possible to improve performance. It is of course necessary for the programmer to hint the compiler where it may do this and where it shouldn't, but in any way try to keep the programmer luckily unknowing about the details. The details may depend on the actual system where the application is running. i.e. if the system is busy with serving a bunch of users then the splitting of the application into a bunch of threads is ot really what you want, but if you are running alone (or almost alone) then the application should be permitted to allocate more resources. The key is that the allocation has to be dynamic.
Anybody knowing of any better languages?
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
Comment removed based on user account deletion
This is what it means for me: http://www.cs.bell-labs.com/who/rsc/thread/
Also see Brian W. Kernighan's "A Descent into Limbo" and Dennis M. Ritchie's "The Limbo Programming Language".
And of course Hoare's classic: Communicating Sequential Processes.
Now you can enjoy the power and beauty of the CSP model in Linux and other Unixes thanks to plan9port including libthread and Inferno; yes, it's all Open Source.
"When in doubt, use brute force." Ken Thompson
There are some significant differences between hyperthreading and Suns approach.
Tiny amount of background:
Hardest part when trying to run things in parallel is figuring out what you can run in parallel. Example: two operations (pseudocode): c=a+b and d+c+e. These two cannot be run in parallel, since you need to result of a+b before you can start c+e.
With modern operating systems there are many programs running at one time, and they may contain seperate threads. One assumption of threading is that threads can run asynchronously to one another - you will not get a situtation like that above (okay, okay, I'm simplying!).
With Hyperthreading, Intel gets the CPU to pretend to the OS that there are actually two of them. They duplicate the fetch and decode units, but only use one execute unit - which probably has several FPUs and Integer units. They rely on an FPU or an Integer unit being available to be able to get a performance benefit.
So Intel (up til now) have duplicated the fetch and decode, but still had the same execute unit.
Suns approach is to replicate the whole pipeline - fetch, decode, execute. Intel can't really scale hyperthreading beyond two "processors", whereas Sun are aiming to try and execute 8, 16 or even more at one time.
Because of Intels architecture they can't really scale hyperthreading in this way - for lots of reasons. I'm sure other people can add them.
This really won't be of huge benefit to your Doom3 FPS, but for business apps (think J2EE) or message queues or science applications it will allow compute servers to scale better at heavy loads (i.e. when lots of threads are doing something that isn't IO bound, at the same time).
[ Monday is a terrible way to spend one seventh of your life. ]
Actually, the "best" way to implement the design is to split the thread state from the processing elements, then use locking on the elements. If two threads use independent processor elements, they should be simultaneously executable.
By having many instances of the more common processing elements, you would have many of the benefits of "multi-core" (in that you'd have parallel execution in the general case) but the design would be much simpler because you're working at the element level, not the core level.
Yes, none of this is really any different from hyperthreading, multi-core, or any other parallel schemes. All parallel schemes work in essentially the same way, because they all need to preserve states and lock resources.
Personally, I think REAL Parallel Processing CPUs that can handle multiple threads efficiently are already well-enough understood, they just have to become reasonably mainstream.
For myself, I am much more interested in AMD's Hyper Tunneling bus technology, which looks like it could supplant most of the other bus designs out there.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
A big webserver or database server is a highly parallel, memory-latency-bound system; each request is an individual thread, and in most database and web servers, locks are finegrained enough to allow many requests to proceed in parallel, subject to them being able to retrieve the data from RAM or disk in a timely fashion.
I appear to have a blog. Odd.
Your using the wrong word in there. Where you use the word "thread", you should be using the work "process" in UNIX parlance. What you are describing is "multi-tasking" in roughly a generic sense. It wasn't invented with the i386, try sometime in the 1960's (I'd have to crack out an OS book to be sure of the date).
Threads are different then processes.
Fundamentally, the standard definition of a thread is: "Seperate control of CPU, with the same VM space". Essentially, it's two processes who have precisely the same memory mapped. (I'm sure there are lots of details I just glossed over, but essentially, that's it). On thing this leads to is blowing the L1 cache if you have several threads interacting on the same pieces of memory.
Threads have lots of performance problems, but they also greatly simplify programming as you pay a lot less attention to the "shared memory" aspect of it. You add some locking, and then essentially multiple threads of execution can work on the same bits of memory.
However, you can do roughly what you described at the process level. Apache used to have a significant patch for "State Threads", that used various OS primitives to tell if an OS call would be blocking or not. If it wasn't going to be, it would make the call. If it was, it'd move on and see if there was more interesting work it could do rather then blocking.
Threads a huge performance win, because any time you have multiple independent tasks they can be performed in parallel assuming you have enough CPU units. The problem is most computer problems have areas where there are separate areas where they can work on, and then they have other areas where they have to sync up and work serially thru some portions. At points, it's the overhead of spawning threads, and syncing is a losing proposition. However for a lot of things, it's the obvious way to speed up performance (GUI applications, it's nice to have on thread that works at keeping the screen up dated, while another is fetching data to display on the screen, thus avoiding applications that feel non-responsive because the window won't refresh for long periods of time while data is being fetched).
This sounds roughly like, they are adding hardware support for threading just like the TLB hardware got added to make VM run at a sane speed. It's fundamentally we have this cool stuff we do in software, the sucks speedwise becuase the hardware is bad at X.
Kirby
there are still some applications where raw CPU speed matters.
We have been at the thoughtput is good enough point for several years. In truth, this is old news really. I've got IRIX servers doing lots of things plenty fast, clipping along at a brisk 400Mhz. There is not much you can't do with that, particularly when running a nice NUMA box.
I assume the same holds true for SUN gear. (I think their NUMA performance is a bit lower than the SGI, but I also don't think it matters for a lot of enterprise stuff.)
One application I have running, NUMA style, is MCAD. It's cool in that I have one copy of the software serving about 25 users, running on a nice NUMA server that never breaks. Admin is almost zero, except for the little things that happen from time to time --mostly user related.
However, I'm going to have to migrate this to a win32 platform. (And yes, it's gonna suck.) Why? The peak CPU power available to me is not enough for very large datasets and I cannot easily make the data portable for roaming users. (If there were more MCAD on Linux, I could do this, alas...)
Love it or hate it, the hot running, inefficient Intel / AMD cpu delivers more peak compute than any high I/O UNIX platform does. And it's cheap.
Sun is stating the obvious with the whole I/O thing, IMHO. In doing so, they avoid a core problem; namely, peak compute is not an option under commercial UNIX that needs to be. (And where it is, there are no applications, or the cost is just too high...)
This is where Linux is really important. It runs on the fast CPU's, but also is plenty UNIXey to allow smart admins to capture the benefits multi-user computing can provide.
Linux rocks, so does Solaris, IRIX, etc... The difference is that I can get IRIX & solaris applications.
WISH THAT WOULD CHANGE FASTER THAN IT CURRENTLY IS.
Blogging because I can...
Skiming the article, it doesn't even seem this processor bothers with out-of-order execution or register renaming; if it stalls, it just starts issuing from a different thread.
Those who fail to understand communication protocols, are doomed to repeat them over port 80.
And actually, this makes me so grumpy that I forgot the whole other piece.
Despite the fact that Sun markets the UltraSparc IV as a single processor, software licensors like BEA and Oracle require that you license their software PER CORE. This means that a "4 processor" UltraSparc IV requires 8 processor licenses for Oracle or Weblogic.
Sun never tells you this, and consequently a lot of people suddenly get tagged with additional licenses if they get audited. BEYOND that, Sun tells people that they can "double their performance" by replacing all of their UltraSparc IIIs with UltraSparc IVs, not explaining that they are doubling their performance because they're doubling the number of processors, AND that doing that upgrade can put them on the hook for literally hundreds of thousands of dollars in software cost.
We've seen a number of companies get bitten by that, and it is downright disingenuous of Sun.
Thanks,
Matt
me@mzi.to
Go price it out, apples to apples, and Sun really can compete with Dell. It will take people a while to really understand this (a double-take is probably in order), but it is true.
More like none of Sun's competitors have anything which comes remotely close.
Notice how nearly a year after Sun announced this, intel finally admitted that clock frequency (i.e. gigahertz) isn't everything and that they'd be bringing out dual core processors?
Niagara has 8 cores each capable of 0-clock cycle latency switching between 4 different thread contexts.
Who else has working hardware and an OS to go that can do this?
Stick Men
You miss one of the major points in the article, and that is that CMT is not really about the Ultra IV being a fully CMT processor. This is about the Niagra chip. The Niagra chip is truely a CMT processor.
The reason this is so is because it functions as both a chip multi-processor and as a multi-threaded core (although I think I'd consider their multi-threaded cores to be fine-grained multi-threading rather then SMT but thats a different story altogether). While IBM's power5 offers these same advantages (dual core, 2 way SMT cores) this is 4 threads per processor and not overly impressive.
The Niagra chip in comparison to IBM (and upcoming Intel dualcore/SMT designs) is based on the assumption that at higher clock speeds the cpu is rarely fully utitlized (while the P4 can retire up to 3 instructions per cycle many apps, particularly data-intensive apps have an IPC of less than 1). The chip contains 8 cores with 4 threads being executed on each core. This means 32 threads can run concurrently. Sure no single thread will run as fast as it would on a NetBurst, athlon64, or power chip, but the combined throughput is enormous. Assuming each runs at ~ 1/4 the speed of their counterpart, that still gives us 8 threads on a single chip. This is enormous, and will have a major impact on database design (I'm currently doing research on SMT's effect on database algorithms) and the payoffs can be great (as can standard prefetching).
I wouldn't reccomend writing off CMT as a marketing buzzword etc. The era of throughput computing is upon us, lets just hope Oracle and the other per-processor vendors change their liscencing to something that correlates with TPC performance or some other metric that still has meaning, otherwise companies are better off with a couple massively parallel single core chips that cost a whole lot more and generate a whole lot more power for the performance they produce.
Phil
Though in theory the Niagra design is another CMT implementation, its the implementation that is the crux here. CMT theory, has been worked around in academia since 6-8 years I think.
Here is a very informative article on the Niagara design.
For the lazy some main points from the article.
- The Pentium 4 is a single core dual threaded CMT implementation. The Niagara has 8 cores and each core is capable of executing 4 threads.
- Depending on the model of the application that is executing, a programmer can choose to either utilize it as a single process with multiple threads each mapped on to a hardware thread or as multiple processes mapped to hardware threads. Apart from this, individual cores can also be assigned to an individual process, adding one more level of flexibility.
- Sharing data between threads on the same core is an L1 read and is extremely fast. Sharing data among threads on separate cores is an L2 read (since L2 is shared among cores)
- The new chip provides a lot of flexibility in terms of how the programmer wants to allocates hardware threads across software processes or threads. But it looks like programming on it will be difficult unless the operating system provides very good support for it.
Sun's CNP is modeled after Tera's MTA architecture (now named Cray again), which trades memory latency for throughput. Basically, in MTA (massively threaded architecture) each of 128 processor threads issues a few memory fetch instructions and waits for the memory to arrive (dozens to hundreds of cycles). This happens for every thread so the effect is that memory fetches and execution time are separated... iow time=max(execution,fetch) vs time=exeuction+fetch of normal processors. This also makes having a pipeling irrelevant so no effore is wasted in branch prediction.
That's great for scientific apps since they are massively parallel... Sun has taken the same idea and scaled it down to 4 overlapping threads so normal applications can benefit. While it can be used to run 4 separate process threads at a time, at least the MTA's is fine-grained so that what really happens is that the compile changes a for (;;i++) loop into four (;;i+=4) loops and runs them in parallel.
This technology done right means a massive performance boost (as in like 25-50%) while also simplifying the processor. Contrast that this Hyperthreading, which complicates the processor and only gets ~5-8% benefit on average... it's mostly designed to minimize context switch times.
In fact I do know a better language, Ada95/2005.
It's simply meant for threading and unconventional compiler optimizations (through the enforcement of constraints), while still being imperative and having a familiar syntax. And it's meant to be compiled unlike Java.
Here's a site about Ada and here's another one.
A good (alas not perfect) Ada95 compiler is included in GCC 3.4.
So aye, we are ready for the CMT systems.
Bruce
Bruce Perens.
Hyperthread DOES NOT HAVE ADDITIONAL FETCH and DECODE, it just permits 2 different threads to occupy the the reorder buffer thus reducing penalties as a result of a context switch, so instead of a context switch the CPU fools the OS into thinking it can issue two threads of instruction simultaneously. So fetch is designed to switch between instruction memory locations based on a turn system, so it really starts work on one thread and then in the next cycle begins work on that thread. It keeps 2 separate rename tables one for each instruction, and keeps track of which thread a given instruction is. So essentially execute is the same even the reorder buffer is almost the same but it tracks which thread an op is running on. The tricky part is getting the front end to toggle correctly between the 2 regfiles and the 2 rename tables. Also fetching from different threads of control is also tricky, I think some sort of queue is used.
Fyi, hyperthreading is used on intel because the number of instructions in-flight. The processor during a context switch interupts, saves to the stack, clears out the REGFILE, rename table, and the ROB, losing all the work accomplished that is not written back to the Regfile. So on an AMD processor this is not a huge deal, but on the P4 this is a problem because the frequent context switches that occur on modern systems cause the intel design to lose the advantage of having many instructions in flight. AMD could realize performance gains just not as much and at the cost of clockspeed.
As for CMT, no it is essentially hyperthreading but could be a better, more costly, more effective design than intels simple design. Duplication of a pipeline is a multicore chip which Sun is doing with Niagra.
I'm sorry but that's not correct. What you refer to is known as "Fine Grained Multithreading", or the product name "Superthreading".
Intel's product "hyperthreading" is also known as "simultaneous multithreading" and is able to run multiple instruction streams simultaneously in order to maximise use of the functional units when they are not saturated by a single instruction stream. This is in addition to avoiding complete stalls on cache misses.
I wasn't even assuming they have that much. The minimum you need to make this trick work is two independent contexts. That means two copies of all kernel-visible control and data registers. You would probably not need to save internal microstate unless you need it to restart a long-running instruction.
Anything else on top of that is optimization.
Bruce
Bruce Perens.
I just tested it with GCC 2.95.3, 3.2.1, 3.3, and 3.4.2, and it works fine. Of course, GCC is just ignoring the #pragma. I didn't know about OpenMP before this, but it does look like a good way to "optimize later" and have your code still compile with gcc. And you don't have to write and maintain two different versions separated by #ifdef, #else, #endif.
My other first post is car post.
Sure, not everything is like that, but some things are. So quit raining on everyone's parade. ;)
Not at all, because you can add d+e. For example: