Multithreading - What's it Mean to Developers?
sysadmn writes "Yet another reason not to count Sun out: Chip Multithreading. CMT, as Sun calls it, is the use of hardware to assist in the execution of multiple simultaneous tasks - even on a single processor. This excellent tutorial on Sun's Developer site explains the technology, and why throughput has become more important than absolute speed in the enterprise.
From the intro: Chip multi-threading (CMT) brings to hardware the concept of multi-threading, similar to software multi-threading. ... A CMT-enabled processor, similar to software multi-threading, executes many software threads simultaneously within a processor on cores. So in a system with CMT processors, software threads can be executed simultaneously within one processor or across many processors. Executing software threads simultaneously within a single processor increases a processor's efficiency as wait latencies are minimized. "
How long has hyperthreading been available on Intel CPU's?
I am a developper, mainly in C, and I did a lot of programation on QNX4 with multi-threading (even if QNX4 implantation is not *really* threads), now I am doing it in Precise/MQX.
Multi-threading comes with synchronization, semaphore, mutex, etc, once you know how to deal with them, it's easy.
I dont mean to look a gift horse in the mouth..
..but wouldn't it be even better if it was hyper-multi-threading?
air and light and time and space
this makes me wonder what the effect would be on something like stackless python?
the whole state pickling concept is pretty cool, and kind of throws threads all over..
anime+manga together at last.. in real time.
This is Sun's Niagara Design. The more I learn about it, the more I think that it's nothing that exciting.
From the lack of non-Sun-supplied buzz regarding this technology, it would appear that many people aren't finding it very exciting.
I'm a big tall mofo.
It means we're going to have to lean to program in parallel. We're going to have to parallelize our data processing and we're going to have to learn synchronization and locking methods.
This is nothing new. The decreasing returns and impending limits of single threaded processing has been upcoming for a long time now.
Start Running Better Polls
Well, if your data conversions are independent, multithreading might be of benefit to you if you have a hyperthreading processor.
And are you sure you are maxing the processor? Surely you have to wait for disk or network, at least some of the time. If more than 10% or so (number pulled from ass but based on empirical observations) of you time is spent waiting for latent devices, you can benefit from multithreading even on a plain vanilla single CPU system with no hyperthreading.
Hyperthreading makes a single processor appear as multiple processors to the OS. The OS still has to do all of the loading and storing yadayada associated with threading. From what I gather, CMT handles the threading overhead in hardware for faster context switches. Sort of reminicent of register windows on the SPARC chip.
Can I still use INKEY in my basic programs? Will multi-threading make it more efficient? Can I actually run a second program on my DOS PC without having to force it as a TSR?
Well, if your data conversions are independent, multithreading might be of benefit to you if you have a hyperthreading processor.
Unless the two execution states overflow your L1 cache, in which case a HT CPU could run slower.
"We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
its a nice theory in all, but im not too sure about it. if done correctly on a single threaded system, if one thread is in a wait state waiting on disc activity, then the CPU should jump threads and handle other tasks in the mean time. there is more then enough RAM and CPU Cache on modern computers that makes this quite effective. Also, isnt this what DMA channels are for? Wasn't the purpose of a DMA channel to mode data from one location to another while the CPU is performing other tasks? .... this is actually getting back to programming at the hardware level of a 386, its nothing new.
Throughput computing maximizes the throughput per processor and per system. So a processor with multiple cores will be able to increase the throughput by the number of cores per processor. This increase in performance comes at a lower cost, fewer systems, reduced power consumption, and lower maintenance and administration, with increase in reliability due to fewer systems. (from TFA, emphasis mine)
So it seems they invented a way to linearly scale peformance. WOW! But maybe I misunderstood and the thing is over my head.
CC.
TaijiQuan (Huang, 5 loosenings)
Not sure I buy that this "increases a processor's efficiency as wait latencies are minimized". It seems to me that decreasing latency reduces efficiency because you spend a greater percentage of your cycles changing state (overhead) instead of doing useful work. This is why realtime OS'es aren't the norm: they reduce latencies to critical maximums, but at the cost of overall throughput.
[ home ]
It means "Difficult to reproduce bugs".
It worries me how many people just say "it means faster programs and doesn't take much more work". That mindset leads to lazy programmers who A - Can't optimize to save their jobs; and B - Don't actually understand what multithreading really does.
If you consider it easy, you've either just thrown great big global locks on most of your code, in which case your code doesn't actually parallelize well; or you've written what I refer to in my first sentence - Bugs that take an immense effort just to reproduce, nevermind track down and fix.
1.3 Simultaneous Multi-Threading
Simultaneous multi-threading [15],[16],[17] uses hardware threads layered on top of a core to execute instructions from multiple threads. The hardware threads consist of all the different registers to keep track of a thread execution state. These hardware threads are also called logical processors. The logical processors can process instructions from multiple software thread streams simultaneously on a core, as compared to a CMP processor with hardware threads where instructions from only one thread are processed on a core.
SMT processors have a L1 cache per logical processor while the L2 and L3 cache is usually shared. The L2 cache is usually on the processor with the L3 off the processor. SMT processors usually have logic for ILP as well as TLP. The core is is not only usually multi-issue for a single thread, but can simultaneously process multiple streams of instructions from multiple software threads.
1.4 Chip Multi-Threading
Chip multi-threading encompasses the techniques of CMP, CMP with hardware threads, and SMT to improve the instructions processed per cycle. To increase the number of instructions processed per cycle, CMT uses TLP [8] (as in Figure 6) as well as ILP (see Figure 5). ILP exploits parallelism within a single thread using compiler and processor technology to simultaneously execute independent instructions from a single thread. There is a limit to the ILP [1],[12],[18] that can be found and executed within a single thread. TLP can be used to improve on ILP by executing parallel tasks from multiple threads simultaneously [18],[19].
Napster-to-go says "Fill and refill your compatible MP3 player", which is a lie. It's not MP3. It's WMA with DRM.
As many others have already pointed out, Intel has had Hyperthreading available in Pentium 4 and Xeon CPUs for a couple of years now, which does exactly what the article is talking about.
I was skeptical at first, and read some of those articles showing that some applications could actually run slower. But then I tried it for myself, and I have to admit I've been impressed. My main box is a dual-Xeon, each with Hyperthreading turned on. It appears to Linux as if I have four independent CPUs. A few numerical tasks saturate the processors if I have just two of them running in parallel, but several tasks do fine with four or more copies. My favorite is "make -j 4" - starting four gcc processes in parallel works surprisingly well. How long does it take you to compile the Linux kernel?
The real issue is how large each thread can be (in the matter of memory) before it has to access data that is external to the thread. It may mean a lot for gamers running close to reality games and also for those that are doing massive calculations.
The important thing is that developers has to be aware of the possibilities and limitations around this technology. Otherwise it would be like throwing a V8 into a T-Ford. It is possible, but you would never be able to utilize the full power.
Another thing is that todays programming languages are limited. C (and C++) are advanced macro assemblers (not really bad, but it requires a lot of the programmer). Java has thread support, but it's still the programmer (in most cases) that has to decide. Java is not very efficient either, which of course is depending on which platform it's running on in combination with general optimizations. C# is Microsoft's bastard of Java and C++ with the same drawbacks as Java.
There are other languages, but most of them are either too obscure (like Erlang or Prolog) or too unknown.
The point is that a compiler shall be able to break out separate threads and/or processes whenever possible to improve performance. It is of course necessary for the programmer to hint the compiler where it may do this and where it shouldn't, but in any way try to keep the programmer luckily unknowing about the details. The details may depend on the actual system where the application is running. i.e. if the system is busy with serving a bunch of users then the splitting of the application into a bunch of threads is ot really what you want, but if you are running alone (or almost alone) then the application should be permitted to allocate more resources. The key is that the allocation has to be dynamic.
Anybody knowing of any better languages?
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
Comment removed based on user account deletion
This is what it means for me: http://www.cs.bell-labs.com/who/rsc/thread/
Also see Brian W. Kernighan's "A Descent into Limbo" and Dennis M. Ritchie's "The Limbo Programming Language".
And of course Hoare's classic: Communicating Sequential Processes.
Now you can enjoy the power and beauty of the CSP model in Linux and other Unixes thanks to plan9port including libthread and Inferno; yes, it's all Open Source.
"When in doubt, use brute force." Ken Thompson
Actually, the "best" way to implement the design is to split the thread state from the processing elements, then use locking on the elements. If two threads use independent processor elements, they should be simultaneously executable.
By having many instances of the more common processing elements, you would have many of the benefits of "multi-core" (in that you'd have parallel execution in the general case) but the design would be much simpler because you're working at the element level, not the core level.
Yes, none of this is really any different from hyperthreading, multi-core, or any other parallel schemes. All parallel schemes work in essentially the same way, because they all need to preserve states and lock resources.
Personally, I think REAL Parallel Processing CPUs that can handle multiple threads efficiently are already well-enough understood, they just have to become reasonably mainstream.
For myself, I am much more interested in AMD's Hyper Tunneling bus technology, which looks like it could supplant most of the other bus designs out there.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Since I mostly work on J2EE stuff, I let the container take care of the threading for me. The one exception is J2EE Connector Architecture (JCA) bits that use the work manager. Even there, however, most of my work is simply putting a thin JCA layer in place between the outside world and the J2EE stack.
For me, these new chips simply mean increased performance for deployed apps, without any modification to the app code.
Beauty!
668: Neighbour of the Beast
CMT is nothing more than multi-core processors. Sun is using the marketing idea of CMT to hide the fact that the UltraSparc IV is nothing more than two UltraSparc III cores on one chip.
One way to look at this is Sun maximizing their existing engineering efforts. However, by marketing it as some revolutionary feature advance, they're implying that they've done something new and exciting, as opposed to something that IBM is already doing and AMD and Intel are working on.
Beyond that, Sun and Fujitsu have a co-manufacturing and R&D deal now, confirming something those in the enterprise space have been saying for a long time - Fujitsu was making better Sun servers than Sun.
Plus Sun killed plans for the UltraSparc V, leaving only the Niagra. They have the Opteron line pushing up from below, and rapidly evaporating sales at the high end. They're resorting to marketing gibberish to add new features to the product line, while simultaneously offloading R&D and manufacturing to a partner.
Remind me again why Sun is in the hardware business?
Thanks,
Matt
me@mzi.to
As many others have already pointed out, Intel has had Hyperthreading available in Pentium 4 and Xeon CPUs for a couple of years now, which does exactly what the article is talking about.
As many others know, you know exactly nothing about what you are talking about. HT has basically two sets of registers so that during a cache miss which would cuase a bubble the chip switches to the other set so it doesn't sit idle. Suns chip on the other hand actually have multiple corses physically doing work at the same time. In fact were it not for Intel's hideously flawed NetBurst architecture the hideous hack that is HyperThreading would not provide any preformance increase at all (in fact it doesn't as much provide an increase as much as negate a decrease...). For evidence consider how many Pentium Ms have HT on them... Now I may not be fully correct but I didn't volunteer a comment; I only posted to prevent the misinformation of others. You'll find more on ArsTechnica. I'd link to the article but I can't find anything on their redesigned site.
Your CPU is not doing anything else, at least do something.
Try make -j5 or -j6. Tends to have better results than the -j4 on my dual Xeon rig. And yes, I have benchmarked it.
My blog. Good stuff (when I remember to update it). Read it.
Cool, we did a bunch of research back in the mid 90s using MPI and published some papers about threaded communications and the like inside of MPI implementations. Also, it was common practice back on the i860 paragons with two or three processors per node to devote one of the CPUs totally to communications while the other cranked away.
Also, be careful that you take the working set into consideration. Suppose you had one processor with 1M L2 cache but your problem needed 1.5M data to work on. It runs at around main memory speeds. However, take two such processors and (if you can) divide the data in half, each processor can now fit all of its data inside L2, which runs at L2 speeds. You can see superlinear speedups that way too.
However, what you are saying is pretty much right on... communication overhead is almost all integer work so if you have an FPU compute thread going on and have the communications offloaded to a thread, those two things should play quite nicely on hyperthreaded Intel parts. This is even cheaper than the other past solutions of burning an entire CPU for communications while the other does computation.
Your using the wrong word in there. Where you use the word "thread", you should be using the work "process" in UNIX parlance. What you are describing is "multi-tasking" in roughly a generic sense. It wasn't invented with the i386, try sometime in the 1960's (I'd have to crack out an OS book to be sure of the date).
Threads are different then processes.
Fundamentally, the standard definition of a thread is: "Seperate control of CPU, with the same VM space". Essentially, it's two processes who have precisely the same memory mapped. (I'm sure there are lots of details I just glossed over, but essentially, that's it). On thing this leads to is blowing the L1 cache if you have several threads interacting on the same pieces of memory.
Threads have lots of performance problems, but they also greatly simplify programming as you pay a lot less attention to the "shared memory" aspect of it. You add some locking, and then essentially multiple threads of execution can work on the same bits of memory.
However, you can do roughly what you described at the process level. Apache used to have a significant patch for "State Threads", that used various OS primitives to tell if an OS call would be blocking or not. If it wasn't going to be, it would make the call. If it was, it'd move on and see if there was more interesting work it could do rather then blocking.
Threads a huge performance win, because any time you have multiple independent tasks they can be performed in parallel assuming you have enough CPU units. The problem is most computer problems have areas where there are separate areas where they can work on, and then they have other areas where they have to sync up and work serially thru some portions. At points, it's the overhead of spawning threads, and syncing is a losing proposition. However for a lot of things, it's the obvious way to speed up performance (GUI applications, it's nice to have on thread that works at keeping the screen up dated, while another is fetching data to display on the screen, thus avoiding applications that feel non-responsive because the window won't refresh for long periods of time while data is being fetched).
This sounds roughly like, they are adding hardware support for threading just like the TLB hardware got added to make VM run at a sane speed. It's fundamentally we have this cool stuff we do in software, the sucks speedwise becuase the hardware is bad at X.
Kirby
Sun's upcoming "Niagra" chips are supposed to have eight cores, each core being able to execute four threads. So that allows upto 32 threads executing at once -- on one physical chip.
And we're not talking about "HyperThreading" where one of the CPUs is virtual. It's a real execution unit.
And Intel and AMD are talking about dual-cores?
This should help save space and energy (both in the power needed to run the box, and in running the cooling system).
there are still some applications where raw CPU speed matters.
We have been at the thoughtput is good enough point for several years. In truth, this is old news really. I've got IRIX servers doing lots of things plenty fast, clipping along at a brisk 400Mhz. There is not much you can't do with that, particularly when running a nice NUMA box.
I assume the same holds true for SUN gear. (I think their NUMA performance is a bit lower than the SGI, but I also don't think it matters for a lot of enterprise stuff.)
One application I have running, NUMA style, is MCAD. It's cool in that I have one copy of the software serving about 25 users, running on a nice NUMA server that never breaks. Admin is almost zero, except for the little things that happen from time to time --mostly user related.
However, I'm going to have to migrate this to a win32 platform. (And yes, it's gonna suck.) Why? The peak CPU power available to me is not enough for very large datasets and I cannot easily make the data portable for roaming users. (If there were more MCAD on Linux, I could do this, alas...)
Love it or hate it, the hot running, inefficient Intel / AMD cpu delivers more peak compute than any high I/O UNIX platform does. And it's cheap.
Sun is stating the obvious with the whole I/O thing, IMHO. In doing so, they avoid a core problem; namely, peak compute is not an option under commercial UNIX that needs to be. (And where it is, there are no applications, or the cost is just too high...)
This is where Linux is really important. It runs on the fast CPU's, but also is plenty UNIXey to allow smart admins to capture the benefits multi-user computing can provide.
Linux rocks, so does Solaris, IRIX, etc... The difference is that I can get IRIX & solaris applications.
WISH THAT WOULD CHANGE FASTER THAN IT CURRENTLY IS.
Blogging because I can...
"Intel has had Hyperthreading available in Pentium 4 and Xeon CPUs for a couple of years now, which does exactly what the article is talking about"
You are wrong. Period. Sun's CMT is several independent CPU cores on the same die with a huge bandwidth interconnect on-die. Intel's Hyperthreading is a gimmicky technology that has a very small real-world impact on performance.
And your personal "benchmarks" cite no numbers. I be trolled!
-- Microsoft is the most expensive commodity operating system and office suite vendor in the marketplace.
While conceptually unrelated, I put threads into the same mental category as untyped pointers. They are extremely powerful, but a complete PITA to debug if anything goes wrong, even moreso if you are maintaining someone else's void* or pthread_create filled application.
What I've always done is code extremely defensively:
1. make the various threads data-independent enough to be free-running and only co-ordinate at the start and finish of a thread's activity. If necessary, re-architect everything in sight to make this possible.
2. when interaction is required, get a nice big coarse-grained lock and do everything that needs to be done and get it over with. profile it; there's a good chance it'll be over with quickly enough that it won't erase gains from parallelism or at least you can see what's taking so long and move it outside the lock.
3. do TONS of load testing with lots of big files and random data. thread-related bugs can often hide for years in your code. Unlike divide by zero or null pointer references, a thread bug won't necessarily give any kind of hardware fault or exception. You have to go hunt for the bugs, they won't just pop up and say hi here i am.
4. If you have multiple people of various technical abilities working on the code, you should add a grep/sed script to your makefile to check for accidental introduction of mt-unsafe library calls (strtok, ctime, etc). Flag new monitors and locks for review. Warn about dumb things like using static or global variables.
5. Last trick is to use a layer to allow your program to be compiled for fork/wait, pthread_create/pthread_join, or just plain old co-routine execution (esp if there is a socket you can set to non-blocking). In addition to being able to test your code for correctness in various situations, you also have a baseline to see if the multithreading is an actual improvement.
With the obvious exceptions for embarassingly parallel algorithms, I've found that humdrum client/server or middleware stuff:
(a) gets only marginal gains from multithreading
(b) you have to work for it--profiling and tuning are still required to get top-notch performance
(c) effectient scaling beyond a handful of threads is the exception not the rule. If you have more threads than CPU's, it's a simple fact that some of them are going to be waiting and then your scaling is done.
When a person says something, the intended meaning is not ambiguous (unless you are a poet), although the words used to describe that meaning may be.
In this case it was intended to mean "What does it mean" and absolutely nothing else, your grammatical writhings notwithstanding.
More like none of Sun's competitors have anything which comes remotely close.
Notice how nearly a year after Sun announced this, intel finally admitted that clock frequency (i.e. gigahertz) isn't everything and that they'd be bringing out dual core processors?
Niagara has 8 cores each capable of 0-clock cycle latency switching between 4 different thread contexts.
Who else has working hardware and an OS to go that can do this?
Stick Men
Since the Pentium 4 according to Intel, but it's not a good question as that's Intel's trademarked term for their two-thread implementation of simultaneous multithreading:
By contrast, Niagara is implementing Chip-level multiprocessing:
In other words, Niagara implements in hardware, at greater scale, what Pentium 4 offers as an emulation feature. In theory one could SMP on top of CMP chipsets for even greater throughput. If you find the Sun article too hard, the Wikipedia references I have cited will probably prove much easier to understand.
In fact I do know a better language, Ada95/2005.
It's simply meant for threading and unconventional compiler optimizations (through the enforcement of constraints), while still being imperative and having a familiar syntax. And it's meant to be compiled unlike Java.
Here's a site about Ada and here's another one.
A good (alas not perfect) Ada95 compiler is included in GCC 3.4.
So aye, we are ready for the CMT systems.
Actually, when you think about it an improved threading model would actually strongly benefit well-programmed games. Why? Because there are a lot of semi-related processes occuring. Sound, graphics, physics, etc etc... they're all part of the game but work in very different ways.
Now if you're working with a multithreaded CPU, one processor can be handling your CPU-bound graphics work (much of this is handed off to the video card anyhow), another can be doing sound/surround mixing, etc.
In an FPS with complicated AI, you could theoretically hand that off to CPU #2 while #1 is handling different things. Your graphics engine might not have ugly-mofo-alien #235 onscreen to render, but meanwhile he's watching you and looking for a boulder that will offer him good cover to snipe you from instead of just sitting like a drone waiting for a computer-acurate headshot.
Now let's say that PC's going multi-CPU. Maybe you don't need a single superpowerful processor, just a videocard and a few lower-powerful processors. Processor #1 is handing off the environmental data, #2 is prepping it for rendering and shovelling your GPU full of vertices, #3 is playing pinpoint surround for that cricket chirping behind the rock on your far left, and #4 is doing AI for ugly alien mofo #287.
When I think about how games are advancing a lot can come down to interprocess communications and/or bandwidth limitations. The GPU still handles much of the video stuff so your CPU isn't really a bottleneck there in many cases, but as internet connections speed up then you're going to have MMORPGs, FPS's, and more chock full of "actors" that make up sight, sound, physics, and AI that could very well benefit from more CPU's rather than extra ticks on your overclocked single processor.
After all, eye-candy is only a part of realism. True realism is also very much about a multitude of things happening at once.
Make a comment and ask a question and get marked as troll.
Go figure.
Hexy - a strategy game for iPhone/iPod Touch
Ah, but when you have one physical 'chip' that actually consists of four processor cores, you *can* do four simultanious tasks on one processor.
The advantage over good old fashioned SMP? Well, probably the interconnect is way faster, and if the cores all share some cache or something, sibling threads should see some benefit.
Vintage computer games and RPG books available. Email me if you're interested.
I think the most interesting part of the article was when it said "Processor speed has increased many times -- it doubles every two years, while memory is still very slow, doubling every six years."
So maybe it would be more efficent for people to stop screwing around with new processor design ideas for a while and put a little effort in doubling the speed of memory access (and I don't mean by using level whatever caches). Selling motherboards with a faster memory bus would be easy, just give it a cool sounding name kind of like Sega's "Blast Processing". Let's call it "HyperRAM Technology!"
Losing faith in humanity one person at a time.
http://www.annexia.org/tmp/multithreading.ps
Rich.
libguestfs - tools for accessing and modifying virtual machine disk images
all you need is the ability to run processes... which I do right here.... on this abacus...
-pyrrho
Bruce
Bruce Perens.
I wasn't even assuming they have that much. The minimum you need to make this trick work is two independent contexts. That means two copies of all kernel-visible control and data registers. You would probably not need to save internal microstate unless you need it to restart a long-running instruction.
Anything else on top of that is optimization.
Bruce
Bruce Perens.
I just tested it with GCC 2.95.3, 3.2.1, 3.3, and 3.4.2, and it works fine. Of course, GCC is just ignoring the #pragma. I didn't know about OpenMP before this, but it does look like a good way to "optimize later" and have your code still compile with gcc. And you don't have to write and maintain two different versions separated by #ifdef, #else, #endif.
My other first post is car post.
This is not college. Slashdot does not start with "there are no stupid questions". There are, you asked one, AND it was already more covered than the genitals in a tiroller soft sex movie.
The only thing intel's hyperthreading buys you, and what most symmetric multithreading implementations buy you, is a solution to the cache miss problem. If your pipeline stalls, you simply execute the next thread in the list until you get the data you need.
Now, in some sophisticated designs, which is what I'd expected the P4 to do, was to turn the extra parallel execution units into independant ones, so you could issue 2 or 3 instructions simultaneously, and forgoe all the branch prediction, etc.
Turns out that the P4 20 stage pipeline needed help. SMT/Hyperthreading was it.