Multithreading - What's it Mean to Developers?
sysadmn writes "Yet another reason not to count Sun out: Chip Multithreading. CMT, as Sun calls it, is the use of hardware to assist in the execution of multiple simultaneous tasks - even on a single processor. This excellent tutorial on Sun's Developer site explains the technology, and why throughput has become more important than absolute speed in the enterprise.
From the intro: Chip multi-threading (CMT) brings to hardware the concept of multi-threading, similar to software multi-threading. ... A CMT-enabled processor, similar to software multi-threading, executes many software threads simultaneously within a processor on cores. So in a system with CMT processors, software threads can be executed simultaneously within one processor or across many processors. Executing software threads simultaneously within a single processor increases a processor's efficiency as wait latencies are minimized. "
It means "Difficult to reproduce bugs".
It worries me how many people just say "it means faster programs and doesn't take much more work". That mindset leads to lazy programmers who A - Can't optimize to save their jobs; and B - Don't actually understand what multithreading really does.
If you consider it easy, you've either just thrown great big global locks on most of your code, in which case your code doesn't actually parallelize well; or you've written what I refer to in my first sentence - Bugs that take an immense effort just to reproduce, nevermind track down and fix.
1.3 Simultaneous Multi-Threading
Simultaneous multi-threading [15],[16],[17] uses hardware threads layered on top of a core to execute instructions from multiple threads. The hardware threads consist of all the different registers to keep track of a thread execution state. These hardware threads are also called logical processors. The logical processors can process instructions from multiple software thread streams simultaneously on a core, as compared to a CMP processor with hardware threads where instructions from only one thread are processed on a core.
SMT processors have a L1 cache per logical processor while the L2 and L3 cache is usually shared. The L2 cache is usually on the processor with the L3 off the processor. SMT processors usually have logic for ILP as well as TLP. The core is is not only usually multi-issue for a single thread, but can simultaneously process multiple streams of instructions from multiple software threads.
1.4 Chip Multi-Threading
Chip multi-threading encompasses the techniques of CMP, CMP with hardware threads, and SMT to improve the instructions processed per cycle. To increase the number of instructions processed per cycle, CMT uses TLP [8] (as in Figure 6) as well as ILP (see Figure 5). ILP exploits parallelism within a single thread using compiler and processor technology to simultaneously execute independent instructions from a single thread. There is a limit to the ILP [1],[12],[18] that can be found and executed within a single thread. TLP can be used to improve on ILP by executing parallel tasks from multiple threads simultaneously [18],[19].
Napster-to-go says "Fill and refill your compatible MP3 player", which is a lie. It's not MP3. It's WMA with DRM.
The real issue is how large each thread can be (in the matter of memory) before it has to access data that is external to the thread. It may mean a lot for gamers running close to reality games and also for those that are doing massive calculations.
The important thing is that developers has to be aware of the possibilities and limitations around this technology. Otherwise it would be like throwing a V8 into a T-Ford. It is possible, but you would never be able to utilize the full power.
Another thing is that todays programming languages are limited. C (and C++) are advanced macro assemblers (not really bad, but it requires a lot of the programmer). Java has thread support, but it's still the programmer (in most cases) that has to decide. Java is not very efficient either, which of course is depending on which platform it's running on in combination with general optimizations. C# is Microsoft's bastard of Java and C++ with the same drawbacks as Java.
There are other languages, but most of them are either too obscure (like Erlang or Prolog) or too unknown.
The point is that a compiler shall be able to break out separate threads and/or processes whenever possible to improve performance. It is of course necessary for the programmer to hint the compiler where it may do this and where it shouldn't, but in any way try to keep the programmer luckily unknowing about the details. The details may depend on the actual system where the application is running. i.e. if the system is busy with serving a bunch of users then the splitting of the application into a bunch of threads is ot really what you want, but if you are running alone (or almost alone) then the application should be permitted to allocate more resources. The key is that the allocation has to be dynamic.
Anybody knowing of any better languages?
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
I know how to deal with them. It may seem easy at first, but it's actually very hard. Your program can run for days before a thread synchronization bug surfaces and it finally deadlocks. And since it's timing dependent, you can't reproduce it.
In principle there are rules to follow to avoid deadlocks and race conditions, but since they need to be manually enforced, there's always potential for error. At least with memory access bugs the hardware often shows you a segfault; with synchronization problems you usually don't even get that.
I've learned over the years that preemptive multithreading should be used only as a last resort, and even then, it's best to put exactly one synchronization point in the entire app. Self-contained tasks should be dispatched from that point and deliver their results back with little or no interaction with the other threads.
The worst thing you can do is randomly sprinkle a bunch of semaphores, mutexes, etc. all over your app.
There are some significant differences between hyperthreading and Suns approach.
Tiny amount of background:
Hardest part when trying to run things in parallel is figuring out what you can run in parallel. Example: two operations (pseudocode): c=a+b and d+c+e. These two cannot be run in parallel, since you need to result of a+b before you can start c+e.
With modern operating systems there are many programs running at one time, and they may contain seperate threads. One assumption of threading is that threads can run asynchronously to one another - you will not get a situtation like that above (okay, okay, I'm simplying!).
With Hyperthreading, Intel gets the CPU to pretend to the OS that there are actually two of them. They duplicate the fetch and decode units, but only use one execute unit - which probably has several FPUs and Integer units. They rely on an FPU or an Integer unit being available to be able to get a performance benefit.
So Intel (up til now) have duplicated the fetch and decode, but still had the same execute unit.
Suns approach is to replicate the whole pipeline - fetch, decode, execute. Intel can't really scale hyperthreading beyond two "processors", whereas Sun are aiming to try and execute 8, 16 or even more at one time.
Because of Intels architecture they can't really scale hyperthreading in this way - for lots of reasons. I'm sure other people can add them.
This really won't be of huge benefit to your Doom3 FPS, but for business apps (think J2EE) or message queues or science applications it will allow compute servers to scale better at heavy loads (i.e. when lots of threads are doing something that isn't IO bound, at the same time).
[ Monday is a terrible way to spend one seventh of your life. ]
CMT is nothing more than multi-core processors. Sun is using the marketing idea of CMT to hide the fact that the UltraSparc IV is nothing more than two UltraSparc III cores on one chip.
One way to look at this is Sun maximizing their existing engineering efforts. However, by marketing it as some revolutionary feature advance, they're implying that they've done something new and exciting, as opposed to something that IBM is already doing and AMD and Intel are working on.
Beyond that, Sun and Fujitsu have a co-manufacturing and R&D deal now, confirming something those in the enterprise space have been saying for a long time - Fujitsu was making better Sun servers than Sun.
Plus Sun killed plans for the UltraSparc V, leaving only the Niagra. They have the Opteron line pushing up from below, and rapidly evaporating sales at the high end. They're resorting to marketing gibberish to add new features to the product line, while simultaneously offloading R&D and manufacturing to a partner.
Remind me again why Sun is in the hardware business?
Thanks,
Matt
me@mzi.to
As many others have already pointed out, Intel has had Hyperthreading available in Pentium 4 and Xeon CPUs for a couple of years now, which does exactly what the article is talking about.
As many others know, you know exactly nothing about what you are talking about. HT has basically two sets of registers so that during a cache miss which would cuase a bubble the chip switches to the other set so it doesn't sit idle. Suns chip on the other hand actually have multiple corses physically doing work at the same time. In fact were it not for Intel's hideously flawed NetBurst architecture the hideous hack that is HyperThreading would not provide any preformance increase at all (in fact it doesn't as much provide an increase as much as negate a decrease...). For evidence consider how many Pentium Ms have HT on them... Now I may not be fully correct but I didn't volunteer a comment; I only posted to prevent the misinformation of others. You'll find more on ArsTechnica. I'd link to the article but I can't find anything on their redesigned site.
Your CPU is not doing anything else, at least do something.
As far as threading is concerned, one of the few languages I've dealt with that makes mutexes, semaphores, etc. easy to deal with is Java. Most other languages bury the stuff too deep into the proprietary APIs to make them useful. Consider multithreading in win32. We need better programming languages before we can ever start reaping the benefits of good multithreading hardware.
Furthermore, we need to get rid of lazy programming. I'm tired of watching people write slow, lazy, inefficient (in terms of both memory space AND speed) code, and justify its existence with "it'll run fast on the new über-hyper-monkey-quadruple-bucky processors." Too many times, the problem is that you've got slow code running in every thread. If the code wasn't so damned lazy, programmers would care more about nifty new hardware. We're not even coming close to using our current hardware to capacity. I've got a 1.2GHz processor with 1024Mb of RAM, and my box chugs opening an M$ Word doc?! WTF?!
<soapbox>
Most programming in the world is very similar to the universal statu$ symbol in the U.S.A. - a big gas-guzzling SUV. It's not like Jane the Soccer Mom really needs 300hp to haul her kids and groceries around town. Similarly, we have lots of lazy code out there that doesn't do much of anything but consume resources and pollute the environment. A nifty new processor feature won't be noticed in the computing world because it won't get used anyway, just like Jane the Soccer Mom wouldn't notice 100 more horsepower. </soapbox>
I pity the foo that isn't metasyntactic
More like none of Sun's competitors have anything which comes remotely close.
Notice how nearly a year after Sun announced this, intel finally admitted that clock frequency (i.e. gigahertz) isn't everything and that they'd be bringing out dual core processors?
Niagara has 8 cores each capable of 0-clock cycle latency switching between 4 different thread contexts.
Who else has working hardware and an OS to go that can do this?
Stick Men