Slashdot Mirror


Multithreading - What's it Mean to Developers?

sysadmn writes "Yet another reason not to count Sun out: Chip Multithreading. CMT, as Sun calls it, is the use of hardware to assist in the execution of multiple simultaneous tasks - even on a single processor. This excellent tutorial on Sun's Developer site explains the technology, and why throughput has become more important than absolute speed in the enterprise. From the intro: Chip multi-threading (CMT) brings to hardware the concept of multi-threading, similar to software multi-threading. ... A CMT-enabled processor, similar to software multi-threading, executes many software threads simultaneously within a processor on cores. So in a system with CMT processors, software threads can be executed simultaneously within one processor or across many processors. Executing software threads simultaneously within a single processor increases a processor's efficiency as wait latencies are minimized. "

24 of 357 comments (clear)

  1. stackless.. by joeldg · · Score: 4, Interesting

    this makes me wonder what the effect would be on something like stackless python?
    the whole state pickling concept is pretty cool, and kind of throws threads all over..

  2. i dont use multithreading by Anonymous Coward · · Score: 0, Interesting

    anything i write usually maxes out the processor at 100% for days at a time (i deal with huge data conversions)

    so yeah i'd also like to know: what does it mean to me?

    1. Re:i dont use multithreading by fitten · · Score: 2, Interesting

      Cool, we did a bunch of research back in the mid 90s using MPI and published some papers about threaded communications and the like inside of MPI implementations. Also, it was common practice back on the i860 paragons with two or three processors per node to devote one of the CPUs totally to communications while the other cranked away.

      Also, be careful that you take the working set into consideration. Suppose you had one processor with 1M L2 cache but your problem needed 1.5M data to work on. It runs at around main memory speeds. However, take two such processors and (if you can) divide the data in half, each processor can now fit all of its data inside L2, which runs at L2 speeds. You can see superlinear speedups that way too.

      However, what you are saying is pretty much right on... communication overhead is almost all integer work so if you have an FPU compute thread going on and have the communications offloaded to a thread, those two things should play quite nicely on hyperthreaded Intel parts. This is even cheaper than the other past solutions of burning an entire CPU for communications while the other does computation.

  3. Nothing new. by bigtallmofo · · Score: 3, Interesting

    This is Sun's Niagara Design. The more I learn about it, the more I think that it's nothing that exciting.

    From the lack of non-Sun-supplied buzz regarding this technology, it would appear that many people aren't finding it very exciting.

    --
    I'm a big tall mofo.
    1. Re:Nothing new. by zenslug · · Score: 3, Interesting

      The tech is actually pretty good, although it really depends on your application. If you want to run something single-threaded, then the Niagara chip is not going to impress you at all. The speed of the chip is not where its power is. Understand that the name is rather appropriate (i.e. like a river/waterfall): it is not very fast comparatively, but it can handle large volumes very well. Think massively multithreaded uses.

    2. Re:Nothing new. by SunFan · · Score: 3, Interesting


      What's not exciting about a 32-way single board computer? You don't have to program for it any differently than a 32-way SMP mainframe. Solaris does the rest for you.

      --
      -- Microsoft is the most expensive commodity operating system and office suite vendor in the marketplace.
  4. Efficiency and latency are mutal tradeoffs by squarooticus · · Score: 3, Interesting

    Not sure I buy that this "increases a processor's efficiency as wait latencies are minimized". It seems to me that decreasing latency reduces efficiency because you spend a greater percentage of your cycles changing state (overhead) instead of doing useful work. This is why realtime OS'es aren't the norm: they reduce latencies to critical maximums, but at the cost of overall throughput.

    --
    [ home ]
  5. Hyperthreading by Dominic_Mazzoni · · Score: 2, Interesting

    As many others have already pointed out, Intel has had Hyperthreading available in Pentium 4 and Xeon CPUs for a couple of years now, which does exactly what the article is talking about.

    I was skeptical at first, and read some of those articles showing that some applications could actually run slower. But then I tried it for myself, and I have to admit I've been impressed. My main box is a dual-Xeon, each with Hyperthreading turned on. It appears to Linux as if I have four independent CPUs. A few numerical tasks saturate the processors if I have just two of them running in parallel, but several tasks do fine with four or more copies. My favorite is "make -j 4" - starting four gcc processes in parallel works surprisingly well. How long does it take you to compile the Linux kernel?

  6. Re:it means a lot by fitten · · Score: 2, Interesting

    That's fine for producer/consumer type problems, but there are other types of problems that don't lend themselves to that model.

    I've been programming multithreaded code for a while, too, and giant locking (which is what you describe) is not very efficient much of the time for what I've done in the past. Linux and Solaris had this type of architecture for the kernel at one time and they've long since evolved away from that.

    In short, how you use threads really depends on what you are trying to do. Hammering all multi-threaded programming into this one model may not be efficient or easy. That model does serve nicely for a number of tasks, but not all.

  7. Re:-1, Redundant: Hyperthreading. by Anonymous Coward · · Score: 3, Interesting


    The moderators are on crack, today. Intel's hyperthreading is more of a marketing gimmik (which you fell for). It provides, what, a few percent improvement in performance?

    The fact is that Intel's Pentiums spend most of their time _not_doing_anything_at_all_. They just sit their waiting on data.

    Sun's Niagara will be able to queue 32-threads simultaneously, which 8 of those threads computing (8 cores). My guess is that Sun's analysis showed that, on average, three threads are waiting on memory while one can go forward with the data it has. This means that Sun is betting on your beloved Pentium being only 25% efficient!

    I think I've also read that Sun is planning on giving Niagara obsene amounts of bandwidth to RAM. In short, if you are running a web server, for example, it would be stupid to stick with something like Pentium when something like Niagara is available.

  8. This is just Multi-core processing... by mzito · · Score: 5, Interesting

    CMT is nothing more than multi-core processors. Sun is using the marketing idea of CMT to hide the fact that the UltraSparc IV is nothing more than two UltraSparc III cores on one chip.

    One way to look at this is Sun maximizing their existing engineering efforts. However, by marketing it as some revolutionary feature advance, they're implying that they've done something new and exciting, as opposed to something that IBM is already doing and AMD and Intel are working on.

    Beyond that, Sun and Fujitsu have a co-manufacturing and R&D deal now, confirming something those in the enterprise space have been saying for a long time - Fujitsu was making better Sun servers than Sun.

    Plus Sun killed plans for the UltraSparc V, leaving only the Niagra. They have the Opteron line pushing up from below, and rapidly evaporating sales at the high end. They're resorting to marketing gibberish to add new features to the product line, while simultaneously offloading R&D and manufacturing to a partner.

    Remind me again why Sun is in the hardware business?

    Thanks,
    Matt

    --
    me@mzi.to
  9. Re:it means a lot by leonmergen · · Score: 4, Interesting

    I've learned over the years that preemptive multithreading should be used only as a last resort, and even then, it's best to put exactly one synchronization point in the entire app. Self-contained tasks should be dispatched from that point and deliver their results back with little or no interaction with the other threads.

    Exactly, and that's where design patterns come into play... many of these problems have been formally described in patterns you can follow to avoid this; with thread synchronization, you can use the Half-Sync/Half-Async pattern for example, and you can make a task an Active Object so it can deliver its own results...

    Multi-Threaded programming is hard, very hard; but you're not alone who thinks it's hard, and many researchers have formally described a bunch of rules you can follow... if you follow these rules, you often enough eliminate most of the more complicated problems.

    --
    - Leon Mergen
    http://www.solatis.com
  10. Re:Hyperthreading by PitaBred · · Score: 2, Interesting

    Try make -j5 or -j6. Tends to have better results than the -j4 on my dual Xeon rig. And yes, I have benchmarked it.

  11. Sun's new chips by Anonymous Coward · · Score: 3, Interesting

    Sun's upcoming "Niagra" chips are supposed to have eight cores, each core being able to execute four threads. So that allows upto 32 threads executing at once -- on one physical chip.

    And we're not talking about "HyperThreading" where one of the CPUs is virtual. It's a real execution unit.

    And Intel and AMD are talking about dual-cores?

    This should help save space and energy (both in the power needed to run the box, and in running the cooling system).

  12. Re:Hyperthreading by SunFan · · Score: 4, Interesting


    "Intel has had Hyperthreading available in Pentium 4 and Xeon CPUs for a couple of years now, which does exactly what the article is talking about"

    You are wrong. Period. Sun's CMT is several independent CPU cores on the same die with a huge bandwidth interconnect on-die. Intel's Hyperthreading is a gimmicky technology that has a very small real-world impact on performance.

    And your personal "benchmarks" cite no numbers. I be trolled!

    --
    -- Microsoft is the most expensive commodity operating system and office suite vendor in the marketplace.
  13. what MT means to developers by fred+fleenblat · · Score: 3, Interesting

    While conceptually unrelated, I put threads into the same mental category as untyped pointers. They are extremely powerful, but a complete PITA to debug if anything goes wrong, even moreso if you are maintaining someone else's void* or pthread_create filled application.

    What I've always done is code extremely defensively:
    1. make the various threads data-independent enough to be free-running and only co-ordinate at the start and finish of a thread's activity. If necessary, re-architect everything in sight to make this possible.
    2. when interaction is required, get a nice big coarse-grained lock and do everything that needs to be done and get it over with. profile it; there's a good chance it'll be over with quickly enough that it won't erase gains from parallelism or at least you can see what's taking so long and move it outside the lock.
    3. do TONS of load testing with lots of big files and random data. thread-related bugs can often hide for years in your code. Unlike divide by zero or null pointer references, a thread bug won't necessarily give any kind of hardware fault or exception. You have to go hunt for the bugs, they won't just pop up and say hi here i am.
    4. If you have multiple people of various technical abilities working on the code, you should add a grep/sed script to your makefile to check for accidental introduction of mt-unsafe library calls (strtok, ctime, etc). Flag new monitors and locks for review. Warn about dumb things like using static or global variables.
    5. Last trick is to use a layer to allow your program to be compiled for fork/wait, pthread_create/pthread_join, or just plain old co-routine execution (esp if there is a socket you can set to non-blocking). In addition to being able to test your code for correctness in various situations, you also have a baseline to see if the multithreading is an actual improvement.

    With the obvious exceptions for embarassingly parallel algorithms, I've found that humdrum client/server or middleware stuff:
    (a) gets only marginal gains from multithreading
    (b) you have to work for it--profiling and tuning are still required to get top-notch performance
    (c) effectient scaling beyond a handful of threads is the exception not the rule. If you have more threads than CPU's, it's a simple fact that some of them are going to be waiting and then your scaling is done.

  14. Re:What DOES it mean to me? by Anonymous Coward · · Score: 1, Interesting
    You are using the wrong tools.

    Just as managing memory is a "hard problem", but malloc() and free() make it safer, there are toolss that let you use threads safely and easily too.

    Consider using something like OpenMP There is nothing dangerous or risky or hard to debug in examples like

    #pragma omp parallel for private(sum) reduction(+: sum)
    for(ii = 0; ii < n; ii++){
    sum = sum + some_complex_long_fuction(a[ii]);
    }
    If you're trying to write your own "thread create" "thread join" stuff by hand, you're wasting your time and your employer's resources in the same way as if you decided to re-write your own garbage collector.
  15. Games in general would *LOVE* this if done right. by phorm · · Score: 3, Interesting

    Actually, when you think about it an improved threading model would actually strongly benefit well-programmed games. Why? Because there are a lot of semi-related processes occuring. Sound, graphics, physics, etc etc... they're all part of the game but work in very different ways.

    Now if you're working with a multithreaded CPU, one processor can be handling your CPU-bound graphics work (much of this is handed off to the video card anyhow), another can be doing sound/surround mixing, etc.

    In an FPS with complicated AI, you could theoretically hand that off to CPU #2 while #1 is handling different things. Your graphics engine might not have ugly-mofo-alien #235 onscreen to render, but meanwhile he's watching you and looking for a boulder that will offer him good cover to snipe you from instead of just sitting like a drone waiting for a computer-acurate headshot.

    Now let's say that PC's going multi-CPU. Maybe you don't need a single superpowerful processor, just a videocard and a few lower-powerful processors. Processor #1 is handing off the environmental data, #2 is prepping it for rendering and shovelling your GPU full of vertices, #3 is playing pinpoint surround for that cricket chirping behind the rock on your far left, and #4 is doing AI for ugly alien mofo #287.

    When I think about how games are advancing a lot can come down to interprocess communications and/or bandwidth limitations. The GPU still handles much of the video stuff so your CPU isn't really a bottleneck there in many cases, but as internet connections speed up then you're going to have MMORPGs, FPS's, and more chock full of "actors" that make up sight, sound, physics, and AI that could very well benefit from more CPU's rather than extra ticks on your overclocked single processor.

    After all, eye-candy is only a part of realism. True realism is also very much about a multitude of things happening at once.

  16. Re:way to get it wrong by at_18 · · Score: 2, Interesting

    As many others know, you know exactly nothing about what you are talking about.

    Dude, you don't know anything either. P4's hyperthreading is a two-threads implementation of Simultaneous multithreading. Niagara is an 8-way multiprocessor on a chip, and each processor has four-way simultaneous multithreading, exactly like the P4, just with more threads.

    Regarding the amount of concurrent threads, it's basically equivalent to a 16-way Xeon server with hyperthreading enabled, but with much faster inter-processor communication (since it's all inside the same core), and of course much lower cost, heat dissipation, etc.

  17. HyperRAM Technology! by MrNybbles · · Score: 3, Interesting

    I think the most interesting part of the article was when it said "Processor speed has increased many times -- it doubles every two years, while memory is still very slow, doubling every six years."

    So maybe it would be more efficent for people to stop screwing around with new processor design ideas for a while and put a little effort in doubling the speed of memory access (and I don't mean by using level whatever caches). Selling motherboards with a faster memory bus would be easy, just give it a cool sounding name kind of like Sega's "Blast Processing". Let's call it "HyperRAM Technology!"

    --
    Losing faith in humanity one person at a time.
  18. Paper on multithreading by Richard+W.M.+Jones · · Score: 2, Interesting
    It's not a particularly new idea. I wrote a pretty detailed paper at university about multithreading. You can read it here:

    http://www.annexia.org/tmp/multithreading.ps

    Rich.

  19. Re:it means a lot by fupeg · · Score: 3, Interesting
    As far as threading is concerned, one of the few languages I've dealt with that makes mutexes, semaphores, etc. easy to deal with is Java
    Umm, ok. Java has always made synchronization easy to get to use. It's never been particularly straightforward, because of Java's interpretive nature and the all the wonderful JIT liberties allowed for JVMs. Just look at all the confusion around double check locking. JDK 1.5 is the first version of Java to formally expose semaphores. Now they are "easy" to use just like syncrhonization. Verdict is still out on how easy they are to understand.
    Furthermore, we need to get rid of lazy programming.
    Oh brother, here we go again. Let me guess, you could probably write a multi-threaded database server that supported fully ATOMIC operations and transactionality, would only need 4K of memory, and would be blazingly fast on a 486SX machine, right? Over-optimization pundits are the worst, even worse than design pattern pundits. This has been discussed many times before. Fast, buggy code has zero value.
  20. Re:-1, Redundant: Hyperthreading. by InvalidError · · Score: 2, Interesting

    Actually, Intel's research (before HT became reality) said that on average, the instruction decoder was issuing just under 2.5 instructions per tick out of a maximum of 3... so instruction decoder throughput in single-threaded mode is about 75% of maximum.

    On AMD's side, the decoder has quadruple outputs and IIRC, AMD's average is 3 out of 4 so again 75% from maximum.

    By adding SMT, Intel gave the P4 the potential to keep all instruction ports busy and AMD plans to do the same next year... a single-core A64 with SMT would be interesting but we will have to settle with dual-core dual-threaded A64s and P4s which should be interesting as well.

    How do AMD and Intel manage to get 75% single-threaded when we know they will be stalled by RAM? Simple, out-of-order execution - most CPUs can look 32-128 instructions ahead to find something to do while stalled, this is necessary to maximize single-thread performance and would become unnecessary if apps and CPUs became massively multi-threaded, which appears to be what Sun is gunning for.

    As far as concurrent SMT is concerned, I think four threads per CPU core will turn out to be the practical maximum for desktop chips. We will probably see this happen once the A64/P4/PM are upgraded to six execution ports, three or four years from now.

    The only reason Sun can think of doing a SMTx8/32 chip is because their CPUs runs at ~1GHz. At higher speeds, they would not have the necessary timing margins to fit the extra logic to efficiently shuffle execution states between "reserve" and "active" threads.

  21. Re:really....? by ckaminski · · Score: 3, Interesting

    The only thing intel's hyperthreading buys you, and what most symmetric multithreading implementations buy you, is a solution to the cache miss problem. If your pipeline stalls, you simply execute the next thread in the list until you get the data you need.

    Now, in some sophisticated designs, which is what I'd expected the P4 to do, was to turn the extra parallel execution units into independant ones, so you could issue 2 or 3 instructions simultaneously, and forgoe all the branch prediction, etc.

    Turns out that the P4 20 stage pipeline needed help. SMT/Hyperthreading was it.