Slashdot Mirror


Multithreading - What's it Mean to Developers?

sysadmn writes "Yet another reason not to count Sun out: Chip Multithreading. CMT, as Sun calls it, is the use of hardware to assist in the execution of multiple simultaneous tasks - even on a single processor. This excellent tutorial on Sun's Developer site explains the technology, and why throughput has become more important than absolute speed in the enterprise. From the intro: Chip multi-threading (CMT) brings to hardware the concept of multi-threading, similar to software multi-threading. ... A CMT-enabled processor, similar to software multi-threading, executes many software threads simultaneously within a processor on cores. So in a system with CMT processors, software threads can be executed simultaneously within one processor or across many processors. Executing software threads simultaneously within a single processor increases a processor's efficiency as wait latencies are minimized. "

80 of 357 comments (clear)

  1. -1, Redundant: Hyperthreading. by Anonymous Coward · · Score: 2, Insightful

    How long has hyperthreading been available on Intel CPU's?

    1. Re:-1, Redundant: Hyperthreading. by johnhennessy · · Score: 5, Informative

      There are some significant differences between hyperthreading and Suns approach.

      Tiny amount of background:

      Hardest part when trying to run things in parallel is figuring out what you can run in parallel. Example: two operations (pseudocode): c=a+b and d+c+e. These two cannot be run in parallel, since you need to result of a+b before you can start c+e.

      With modern operating systems there are many programs running at one time, and they may contain seperate threads. One assumption of threading is that threads can run asynchronously to one another - you will not get a situtation like that above (okay, okay, I'm simplying!).

      With Hyperthreading, Intel gets the CPU to pretend to the OS that there are actually two of them. They duplicate the fetch and decode units, but only use one execute unit - which probably has several FPUs and Integer units. They rely on an FPU or an Integer unit being available to be able to get a performance benefit.

      So Intel (up til now) have duplicated the fetch and decode, but still had the same execute unit.

      Suns approach is to replicate the whole pipeline - fetch, decode, execute. Intel can't really scale hyperthreading beyond two "processors", whereas Sun are aiming to try and execute 8, 16 or even more at one time.

      Because of Intels architecture they can't really scale hyperthreading in this way - for lots of reasons. I'm sure other people can add them.

      This really won't be of huge benefit to your Doom3 FPS, but for business apps (think J2EE) or message queues or science applications it will allow compute servers to scale better at heavy loads (i.e. when lots of threads are doing something that isn't IO bound, at the same time).

      --
      [ Monday is a terrible way to spend one seventh of your life. ]
    2. Re:-1, Redundant: Hyperthreading. by Anonymous Coward · · Score: 3, Interesting


      The moderators are on crack, today. Intel's hyperthreading is more of a marketing gimmik (which you fell for). It provides, what, a few percent improvement in performance?

      The fact is that Intel's Pentiums spend most of their time _not_doing_anything_at_all_. They just sit their waiting on data.

      Sun's Niagara will be able to queue 32-threads simultaneously, which 8 of those threads computing (8 cores). My guess is that Sun's analysis showed that, on average, three threads are waiting on memory while one can go forward with the data it has. This means that Sun is betting on your beloved Pentium being only 25% efficient!

      I think I've also read that Sun is planning on giving Niagara obsene amounts of bandwidth to RAM. In short, if you are running a web server, for example, it would be stupid to stick with something like Pentium when something like Niagara is available.

    3. Re:-1, Redundant: Hyperthreading. by Anonymous Coward · · Score: 4, Informative

      Hyperthread DOES NOT HAVE ADDITIONAL FETCH and DECODE, it just permits 2 different threads to occupy the the reorder buffer thus reducing penalties as a result of a context switch, so instead of a context switch the CPU fools the OS into thinking it can issue two threads of instruction simultaneously. So fetch is designed to switch between instruction memory locations based on a turn system, so it really starts work on one thread and then in the next cycle begins work on that thread. It keeps 2 separate rename tables one for each instruction, and keeps track of which thread a given instruction is. So essentially execute is the same even the reorder buffer is almost the same but it tracks which thread an op is running on. The tricky part is getting the front end to toggle correctly between the 2 regfiles and the 2 rename tables. Also fetching from different threads of control is also tricky, I think some sort of queue is used.

      Fyi, hyperthreading is used on intel because the number of instructions in-flight. The processor during a context switch interupts, saves to the stack, clears out the REGFILE, rename table, and the ROB, losing all the work accomplished that is not written back to the Regfile. So on an AMD processor this is not a huge deal, but on the P4 this is a problem because the frequent context switches that occur on modern systems cause the intel design to lose the advantage of having many instructions in flight. AMD could realize performance gains just not as much and at the cost of clockspeed.

      As for CMT, no it is essentially hyperthreading but could be a better, more costly, more effective design than intels simple design. Duplication of a pipeline is a multicore chip which Sun is doing with Niagra.

    4. Re:-1, Redundant: Hyperthreading. by InvalidError · · Score: 2, Interesting

      Actually, Intel's research (before HT became reality) said that on average, the instruction decoder was issuing just under 2.5 instructions per tick out of a maximum of 3... so instruction decoder throughput in single-threaded mode is about 75% of maximum.

      On AMD's side, the decoder has quadruple outputs and IIRC, AMD's average is 3 out of 4 so again 75% from maximum.

      By adding SMT, Intel gave the P4 the potential to keep all instruction ports busy and AMD plans to do the same next year... a single-core A64 with SMT would be interesting but we will have to settle with dual-core dual-threaded A64s and P4s which should be interesting as well.

      How do AMD and Intel manage to get 75% single-threaded when we know they will be stalled by RAM? Simple, out-of-order execution - most CPUs can look 32-128 instructions ahead to find something to do while stalled, this is necessary to maximize single-thread performance and would become unnecessary if apps and CPUs became massively multi-threaded, which appears to be what Sun is gunning for.

      As far as concurrent SMT is concerned, I think four threads per CPU core will turn out to be the practical maximum for desktop chips. We will probably see this happen once the A64/P4/PM are upgraded to six execution ports, three or four years from now.

      The only reason Sun can think of doing a SMTx8/32 chip is because their CPUs runs at ~1GHz. At higher speeds, they would not have the necessary timing margins to fit the extra logic to efficiently shuffle execution states between "reserve" and "active" threads.

    5. Re:-1, Redundant: Hyperthreading. by lachlan76 · · Score: 3, Informative
      ardest part when trying to run things in parallel is figuring out what you can run in parallel. Example: two operations (pseudocode): c=a+b and d+c+e. These two cannot be run in parallel, since you need to result of a+b before you can start c+e.

      Not at all, because you can add d+e. For example:
      No multithreading:
      add a,b
      add d,a
      add d,e

      With multithreading:
      First thread:
      add a,b
      ;Make sure other thread is finished
      add a,d

      Second thread:
      add d,e
  2. it means a lot by Anonymous Coward · · Score: 4, Informative

    I am a developper, mainly in C, and I did a lot of programation on QNX4 with multi-threading (even if QNX4 implantation is not *really* threads), now I am doing it in Precise/MQX.
    Multi-threading comes with synchronization, semaphore, mutex, etc, once you know how to deal with them, it's easy.

    1. Re:it means a lot by BigSven · · Score: 2, Informative

      The CVS version of GIMP has Altivec support. That makes it two applications already ;)

    2. Re:it means a lot by Waffle+Iron · · Score: 5, Insightful
      Multi-threading comes with synchronization, semaphore, mutex, etc, once you know how to deal with them, it's easy.

      I know how to deal with them. It may seem easy at first, but it's actually very hard. Your program can run for days before a thread synchronization bug surfaces and it finally deadlocks. And since it's timing dependent, you can't reproduce it.

      In principle there are rules to follow to avoid deadlocks and race conditions, but since they need to be manually enforced, there's always potential for error. At least with memory access bugs the hardware often shows you a segfault; with synchronization problems you usually don't even get that.

      I've learned over the years that preemptive multithreading should be used only as a last resort, and even then, it's best to put exactly one synchronization point in the entire app. Self-contained tasks should be dispatched from that point and deliver their results back with little or no interaction with the other threads.

      The worst thing you can do is randomly sprinkle a bunch of semaphores, mutexes, etc. all over your app.

    3. Re:it means a lot by fitten · · Score: 2, Interesting

      That's fine for producer/consumer type problems, but there are other types of problems that don't lend themselves to that model.

      I've been programming multithreaded code for a while, too, and giant locking (which is what you describe) is not very efficient much of the time for what I've done in the past. Linux and Solaris had this type of architecture for the kernel at one time and they've long since evolved away from that.

      In short, how you use threads really depends on what you are trying to do. Hammering all multi-threaded programming into this one model may not be efficient or easy. That model does serve nicely for a number of tasks, but not all.

    4. Re:it means a lot by leonmergen · · Score: 4, Interesting

      I've learned over the years that preemptive multithreading should be used only as a last resort, and even then, it's best to put exactly one synchronization point in the entire app. Self-contained tasks should be dispatched from that point and deliver their results back with little or no interaction with the other threads.

      Exactly, and that's where design patterns come into play... many of these problems have been formally described in patterns you can follow to avoid this; with thread synchronization, you can use the Half-Sync/Half-Async pattern for example, and you can make a task an Active Object so it can deliver its own results...

      Multi-Threaded programming is hard, very hard; but you're not alone who thinks it's hard, and many researchers have formally described a bunch of rules you can follow... if you follow these rules, you often enough eliminate most of the more complicated problems.

      --
      - Leon Mergen
      http://www.solatis.com
    5. Re:it means a lot by moonbender · · Score: 2, Informative

      I'm not an Apple geek, but from what I read here, OS X itself makes use of AltiVec everywhere it makes sense. That's one application everyone will run 100% of the time. Also, Apple's libraries many/some applications use are optimised for AltiVec. From the sounds of it, AltiVec is used more than its x86 counterparts.

      --
      Switch back to Slashdot's D1 system.
    6. Re:it means a lot by Anonymous Coward · · Score: 2, Informative

      Recent versions of the GNU Compiler Collection, IBM Visual Age Compiler and other compilers provide intrinsics to access AltiVec instructions directly from C and C++ programs.

    7. Re:it means a lot by guitaristx · · Score: 5, Insightful

      As far as threading is concerned, one of the few languages I've dealt with that makes mutexes, semaphores, etc. easy to deal with is Java. Most other languages bury the stuff too deep into the proprietary APIs to make them useful. Consider multithreading in win32. We need better programming languages before we can ever start reaping the benefits of good multithreading hardware.

      Furthermore, we need to get rid of lazy programming. I'm tired of watching people write slow, lazy, inefficient (in terms of both memory space AND speed) code, and justify its existence with "it'll run fast on the new über-hyper-monkey-quadruple-bucky processors." Too many times, the problem is that you've got slow code running in every thread. If the code wasn't so damned lazy, programmers would care more about nifty new hardware. We're not even coming close to using our current hardware to capacity. I've got a 1.2GHz processor with 1024Mb of RAM, and my box chugs opening an M$ Word doc?! WTF?!

      <soapbox>
      Most programming in the world is very similar to the universal statu$ symbol in the U.S.A. - a big gas-guzzling SUV. It's not like Jane the Soccer Mom really needs 300hp to haul her kids and groceries around town. Similarly, we have lots of lazy code out there that doesn't do much of anything but consume resources and pollute the environment. A nifty new processor feature won't be noticed in the computing world because it won't get used anyway, just like Jane the Soccer Mom wouldn't notice 100 more horsepower. </soapbox>

      --
      I pity the foo that isn't metasyntactic
    8. Re:it means a lot by Homology · · Score: 2, Insightful
      As far as threading is concerned, one of the few languages I've dealt with that makes mutexes, semaphores, etc. easy to deal with is Java. Most other languages bury the stuff too deep into the proprietary APIs to make them useful. Consider multithreading in win32 [microsoft.com]. We need better programming languages before we can ever start reaping the benefits of good multithreading hardware.

      Pure bullshit.

    9. Re:it means a lot by MORTAR_COMBAT! · · Score: 3, Informative

      exactly, and unlike Altivec, there are no "special instructions" to get benefits from Niagara -- just synchronization, deadlock, and such parallel processing issues which most enterprise software is already aware of.

      (to dumb it down: no new opcodes, existing software will benefit if it can, break if it was poorly written to begin with.)

      --
      MORTAR COMBAT!
    10. Re:it means a lot by Gauchito · · Score: 2, Insightful

      Another problem with multi-threading is that nothing is a black box anymore (not like anything really is, anyway). Once you start worrying about sharing statics and globals, you need to consider all the accesses done by objects you bring in from other libraries, which means you need to check the source to see if, for example, it uses a static cache (with no locking). Then, you need to dig in, find out why you're getting seg faults or corrupted memory, track down where else you're using this class (could be, for example, inside another object entirely, which could again be from a different library), synchronize with that other thread that you thought was completely unrelated (i.e., used the same class, but had its own instance of it), rinse, and repeat.

      If you are using third party or homegrown (but not yours) libraries inside your multithreaded program, pretty soon you'll realize that not only do you need to know what and how they are accessing, but you also need to keep close tabs on what changes are done in future releases (again, keeping track of implementation details in those release). Your quick and highly parallelized threading program just became a maintenance nightmare.

    11. Re:it means a lot by fupeg · · Score: 3, Interesting
      As far as threading is concerned, one of the few languages I've dealt with that makes mutexes, semaphores, etc. easy to deal with is Java
      Umm, ok. Java has always made synchronization easy to get to use. It's never been particularly straightforward, because of Java's interpretive nature and the all the wonderful JIT liberties allowed for JVMs. Just look at all the confusion around double check locking. JDK 1.5 is the first version of Java to formally expose semaphores. Now they are "easy" to use just like syncrhonization. Verdict is still out on how easy they are to understand.
      Furthermore, we need to get rid of lazy programming.
      Oh brother, here we go again. Let me guess, you could probably write a multi-threaded database server that supported fully ATOMIC operations and transactionality, would only need 4K of memory, and would be blazingly fast on a 486SX machine, right? Over-optimization pundits are the worst, even worse than design pattern pundits. This has been discussed many times before. Fast, buggy code has zero value.
  3. Multithreading? by PopeAlien · · Score: 4, Funny

    I dont mean to look a gift horse in the mouth..

    ..but wouldn't it be even better if it was hyper-multi-threading?

  4. stackless.. by joeldg · · Score: 4, Interesting

    this makes me wonder what the effect would be on something like stackless python?
    the whole state pickling concept is pretty cool, and kind of throws threads all over..

  5. Nothing new. by bigtallmofo · · Score: 3, Interesting

    This is Sun's Niagara Design. The more I learn about it, the more I think that it's nothing that exciting.

    From the lack of non-Sun-supplied buzz regarding this technology, it would appear that many people aren't finding it very exciting.

    --
    I'm a big tall mofo.
    1. Re:Nothing new. by zenslug · · Score: 3, Interesting

      The tech is actually pretty good, although it really depends on your application. If you want to run something single-threaded, then the Niagara chip is not going to impress you at all. The speed of the chip is not where its power is. Understand that the name is rather appropriate (i.e. like a river/waterfall): it is not very fast comparatively, but it can handle large volumes very well. Think massively multithreaded uses.

    2. Re:Nothing new. by SunFan · · Score: 3, Interesting


      What's not exciting about a 32-way single board computer? You don't have to program for it any differently than a 32-way SMP mainframe. Solaris does the rest for you.

      --
      -- Microsoft is the most expensive commodity operating system and office suite vendor in the marketplace.
    3. Re:Nothing new. by g0_p · · Score: 3, Informative

      Though in theory the Niagra design is another CMT implementation, its the implementation that is the crux here. CMT theory, has been worked around in academia since 6-8 years I think.

      Here is a very informative article on the Niagara design.

      For the lazy some main points from the article.
      - The Pentium 4 is a single core dual threaded CMT implementation. The Niagara has 8 cores and each core is capable of executing 4 threads.
      - Depending on the model of the application that is executing, a programmer can choose to either utilize it as a single process with multiple threads each mapped on to a hardware thread or as multiple processes mapped to hardware threads. Apart from this, individual cores can also be assigned to an individual process, adding one more level of flexibility.
      - Sharing data between threads on the same core is an L1 read and is extremely fast. Sharing data among threads on separate cores is an L2 read (since L2 is shared among cores)
      - The new chip provides a lot of flexibility in terms of how the programmer wants to allocates hardware threads across software processes or threads. But it looks like programming on it will be difficult unless the operating system provides very good support for it.

  6. Same thing SMP and such has meant by Soong · · Score: 4, Insightful

    It means we're going to have to lean to program in parallel. We're going to have to parallelize our data processing and we're going to have to learn synchronization and locking methods.

    This is nothing new. The decreasing returns and impending limits of single threaded processing has been upcoming for a long time now.

    --
    Start Running Better Polls
    1. Re:Same thing SMP and such has meant by SunFan · · Score: 2, Insightful

      It means we're going to have to lean to program in parallel.

      Not really. If you've been using SMP servers, what's different about SMP on a chip? Even if you only have a few dozen Apache processes running, Solaris will schedule them onto Niagara just like if you had lots of separate CPUs.

      I don't think this is as big a change as people think. The main advantage will be a super-efficient CPU (50 to 60 watts, IIRC) but with the performance of many regular CPUs (hundreds of watts).

      --
      -- Microsoft is the most expensive commodity operating system and office suite vendor in the marketplace.
    2. Re:Same thing SMP and such has meant by Bastian · · Score: 2, Insightful

      I imagine that multithreading is a situation where OOP finally begins to really shine, as the amount of code factoring involved would make it much easier to keep track of when and where you need to be frotzing with synchronization and locking.

      I also imagine that if you can try to line up thread boundaries with object boundaries, the task of avoiding race conditions becomes almost trivial.

      But then, I haven't done much serious multithreaded programming, so maybe I am missing the point. Someone set me straight.

    3. Re:Same thing SMP and such has meant by Homology · · Score: 2, Insightful
      Solaris treats each process as a single thread in the kernel. It makes little difference from a scheduling point of view whether you have a 32-thread application or a 32-process application, except the latter might consume more memory.

      With threads you have to syncronize access to common data that resides in the same memory adress space. With processes you don't have to do this as they have their own copy of the data at fork.

  7. Re:i dont use multithreading by pclminion · · Score: 4, Insightful
    anything i write usually maxes out the processor at 100% for days at a time (i deal with huge data conversions) so yeah i'd also like to know: what does it mean to me?

    Well, if your data conversions are independent, multithreading might be of benefit to you if you have a hyperthreading processor.

    And are you sure you are maxing the processor? Surely you have to wait for disk or network, at least some of the time. If more than 10% or so (number pulled from ass but based on empirical observations) of you time is spent waiting for latent devices, you can benefit from multithreading even on a plain vanilla single CPU system with no hyperthreading.

  8. Re:How is this different by wezelboy · · Score: 2, Informative

    Hyperthreading makes a single processor appear as multiple processors to the OS. The OS still has to do all of the loading and storing yadayada associated with threading. From what I gather, CMT handles the threading overhead in hardware for faster context switches. Sort of reminicent of register windows on the SPARC chip.

  9. INKEY and TSRs by Anonymous Coward · · Score: 3, Funny

    Can I still use INKEY in my basic programs? Will multi-threading make it more efficient? Can I actually run a second program on my DOS PC without having to force it as a TSR?

  10. Re:i dont use multithreading by Fulcrum+of+Evil · · Score: 2, Insightful

    Well, if your data conversions are independent, multithreading might be of benefit to you if you have a hyperthreading processor.

    Unless the two execution states overflow your L1 cache, in which case a HT CPU could run slower.

    --
    "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
  11. marketing handwave by klossner · · Score: 2, Insightful
    "throughput has become more important than absolute speed in the enterprise"
    I've been seeing this quote in press releases for three decades. It has always meant "we can't compete on performance so we're going to explain why performance isn't important anymore." The few times my management bought that story, they came to regret it.
  12. Re:i dont use multithreading by darkain · · Score: 2, Informative

    its a nice theory in all, but im not too sure about it. if done correctly on a single threaded system, if one thread is in a wait state waiting on disc activity, then the CPU should jump threads and handle other tasks in the mean time. there is more then enough RAM and CPU Cache on modern computers that makes this quite effective. Also, isnt this what DMA channels are for? Wasn't the purpose of a DMA channel to mode data from one location to another while the CPU is performing other tasks? .... this is actually getting back to programming at the hardware level of a 386, its nothing new.

  13. Thruput ... by foobsr · · Score: 2, Funny

    Throughput computing maximizes the throughput per processor and per system. So a processor with multiple cores will be able to increase the throughput by the number of cores per processor. This increase in performance comes at a lower cost, fewer systems, reduced power consumption, and lower maintenance and administration, with increase in reliability due to fewer systems. (from TFA, emphasis mine)

    So it seems they invented a way to linearly scale peformance. WOW! But maybe I misunderstood and the thing is over my head.

    CC.

    --
    TaijiQuan (Huang, 5 loosenings)
  14. Efficiency and latency are mutal tradeoffs by squarooticus · · Score: 3, Interesting

    Not sure I buy that this "increases a processor's efficiency as wait latencies are minimized". It seems to me that decreasing latency reduces efficiency because you spend a greater percentage of your cycles changing state (overhead) instead of doing useful work. This is why realtime OS'es aren't the norm: they reduce latencies to critical maximums, but at the cost of overall throughput.

    --
    [ home ]
    1. Re:Efficiency and latency are mutal tradeoffs by farnz · · Score: 2, Informative

      A big webserver or database server is a highly parallel, memory-latency-bound system; each request is an individual thread, and in most database and web servers, locks are finegrained enough to allow many requests to proceed in parallel, subject to them being able to retrieve the data from RAM or disk in a timely fashion.

    2. Re:Efficiency and latency are mutal tradeoffs by farnz · · Score: 2, Insightful
      A processor's wait latency is the time it spends doing absolutely nothing while it waits for an external device to catch up. If your RAM latency is around 100 cycles, and context switching costs you 100 cycles, you're right in saying that efficiency goes down. On the other hand, if each context switch costs you 10 cycles, you can context switch nine times before you've started to lose efficiency.

      Sun are putting in hardware to ensure that context switches are fast (possibly even one or two cycles); hopefully, this will result in the context switches costing less than waiting for memory accesses, and speed up the throughput of the system as a whole. So, benchmarking one thread of execution will show a slow system, whereas a group will hopefully show a big speedup.

    3. Re:Efficiency and latency are mutal tradeoffs by Relic+of+the+Future · · Score: 2, Informative
      Actually, that's the whole point of this technology: there is no expensive context switch between threads. The processor goes along, issuing instructions from several threads, and when it gets a cache miss for one of the threads, it just keeps chuging along, issuing instructions from the other threads.

      Skiming the article, it doesn't even seem this processor bothers with out-of-order execution or register renaming; if it stalls, it just starts issuing from a different thread.

      --
      Those who fail to understand communication protocols, are doomed to repeat them over port 80.
  15. What DOES it mean to me? by pla · · Score: 5, Insightful

    It means "Difficult to reproduce bugs".

    It worries me how many people just say "it means faster programs and doesn't take much more work". That mindset leads to lazy programmers who A - Can't optimize to save their jobs; and B - Don't actually understand what multithreading really does.

    If you consider it easy, you've either just thrown great big global locks on most of your code, in which case your code doesn't actually parallelize well; or you've written what I refer to in my first sentence - Bugs that take an immense effort just to reproduce, nevermind track down and fix.

    1. Re:What DOES it mean to me? by bjsyd70 · · Score: 2

      If the tasks are unrelated and share no data then getting then you can get them to run in parallel with only a small decrease in reliability/increase in cost. This is a typical case for a web application serving multiple independant clients (you are reading your mail, and I am reading mine).

  16. Not exactly the same. by 1010011010 · · Score: 5, Informative

    1.3 Simultaneous Multi-Threading

    Simultaneous multi-threading [15],[16],[17] uses hardware threads layered on top of a core to execute instructions from multiple threads. The hardware threads consist of all the different registers to keep track of a thread execution state. These hardware threads are also called logical processors. The logical processors can process instructions from multiple software thread streams simultaneously on a core, as compared to a CMP processor with hardware threads where instructions from only one thread are processed on a core.

    SMT processors have a L1 cache per logical processor while the L2 and L3 cache is usually shared. The L2 cache is usually on the processor with the L3 off the processor. SMT processors usually have logic for ILP as well as TLP. The core is is not only usually multi-issue for a single thread, but can simultaneously process multiple streams of instructions from multiple software threads.

    1.4 Chip Multi-Threading

    Chip multi-threading encompasses the techniques of CMP, CMP with hardware threads, and SMT to improve the instructions processed per cycle. To increase the number of instructions processed per cycle, CMT uses TLP [8] (as in Figure 6) as well as ILP (see Figure 5). ILP exploits parallelism within a single thread using compiler and processor technology to simultaneously execute independent instructions from a single thread. There is a limit to the ILP [1],[12],[18] that can be found and executed within a single thread. TLP can be used to improve on ILP by executing parallel tasks from multiple threads simultaneously [18],[19].

    --
    Napster-to-go says "Fill and refill your compatible MP3 player", which is a lie. It's not MP3. It's WMA with DRM.
  17. Hyperthreading by Dominic_Mazzoni · · Score: 2, Interesting

    As many others have already pointed out, Intel has had Hyperthreading available in Pentium 4 and Xeon CPUs for a couple of years now, which does exactly what the article is talking about.

    I was skeptical at first, and read some of those articles showing that some applications could actually run slower. But then I tried it for myself, and I have to admit I've been impressed. My main box is a dual-Xeon, each with Hyperthreading turned on. It appears to Linux as if I have four independent CPUs. A few numerical tasks saturate the processors if I have just two of them running in parallel, but several tasks do fine with four or more copies. My favorite is "make -j 4" - starting four gcc processes in parallel works surprisingly well. How long does it take you to compile the Linux kernel?

  18. In most cases by Z00L00K · · Score: 5, Informative
    multithreading hardware will not mean much, but in some cases it may mean a lot for performance. (with most cases I mean users running Word/Excel/Powerpoint/likewise)

    The real issue is how large each thread can be (in the matter of memory) before it has to access data that is external to the thread. It may mean a lot for gamers running close to reality games and also for those that are doing massive calculations.

    The important thing is that developers has to be aware of the possibilities and limitations around this technology. Otherwise it would be like throwing a V8 into a T-Ford. It is possible, but you would never be able to utilize the full power.

    Another thing is that todays programming languages are limited. C (and C++) are advanced macro assemblers (not really bad, but it requires a lot of the programmer). Java has thread support, but it's still the programmer (in most cases) that has to decide. Java is not very efficient either, which of course is depending on which platform it's running on in combination with general optimizations. C# is Microsoft's bastard of Java and C++ with the same drawbacks as Java.

    There are other languages, but most of them are either too obscure (like Erlang or Prolog) or too unknown.

    The point is that a compiler shall be able to break out separate threads and/or processes whenever possible to improve performance. It is of course necessary for the programmer to hint the compiler where it may do this and where it shouldn't, but in any way try to keep the programmer luckily unknowing about the details. The details may depend on the actual system where the application is running. i.e. if the system is busy with serving a bunch of users then the splitting of the application into a bunch of threads is ot really what you want, but if you are running alone (or almost alone) then the application should be permitted to allocate more resources. The key is that the allocation has to be dynamic.

    Anybody knowing of any better languages?

    --
    If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
  19. Comment removed by account_deleted · · Score: 2, Informative

    Comment removed based on user account deletion

  20. CSP and libthread by CondeZer0 · · Score: 4, Informative

    This is what it means for me: http://www.cs.bell-labs.com/who/rsc/thread/

    Also see Brian W. Kernighan's "A Descent into Limbo" and Dennis M. Ritchie's "The Limbo Programming Language".

    And of course Hoare's classic: Communicating Sequential Processes.

    Now you can enjoy the power and beauty of the CSP model in Linux and other Unixes thanks to plan9port including libthread and Inferno; yes, it's all Open Source.

    --
    "When in doubt, use brute force." Ken Thompson
  21. Old idea, but there are many ways to implement by jd · · Score: 2, Informative
    I designed pretty much the same concept in the OpenCore "FLOOP" project I'm working on. You just have an array of registers, such that each thread's state is maintained and is directly referrable.


    Actually, the "best" way to implement the design is to split the thread state from the processing elements, then use locking on the elements. If two threads use independent processor elements, they should be simultaneously executable.


    By having many instances of the more common processing elements, you would have many of the benefits of "multi-core" (in that you'd have parallel execution in the general case) but the design would be much simpler because you're working at the element level, not the core level.


    Yes, none of this is really any different from hyperthreading, multi-core, or any other parallel schemes. All parallel schemes work in essentially the same way, because they all need to preserve states and lock resources.


    Personally, I think REAL Parallel Processing CPUs that can handle multiple threads efficiently are already well-enough understood, they just have to become reasonably mainstream.


    For myself, I am much more interested in AMD's Hyper Tunneling bus technology, which looks like it could supplant most of the other bus designs out there.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  22. My J2EE Application wil FLY by PHAEDRU5 · · Score: 3, Insightful

    Since I mostly work on J2EE stuff, I let the container take care of the threading for me. The one exception is J2EE Connector Architecture (JCA) bits that use the work manager. Even there, however, most of my work is simply putting a thin JCA layer in place between the outside world and the J2EE stack.

    For me, these new chips simply mean increased performance for deployed apps, without any modification to the app code.

    Beauty!

    --
    668: Neighbour of the Beast
  23. This is just Multi-core processing... by mzito · · Score: 5, Interesting

    CMT is nothing more than multi-core processors. Sun is using the marketing idea of CMT to hide the fact that the UltraSparc IV is nothing more than two UltraSparc III cores on one chip.

    One way to look at this is Sun maximizing their existing engineering efforts. However, by marketing it as some revolutionary feature advance, they're implying that they've done something new and exciting, as opposed to something that IBM is already doing and AMD and Intel are working on.

    Beyond that, Sun and Fujitsu have a co-manufacturing and R&D deal now, confirming something those in the enterprise space have been saying for a long time - Fujitsu was making better Sun servers than Sun.

    Plus Sun killed plans for the UltraSparc V, leaving only the Niagra. They have the Opteron line pushing up from below, and rapidly evaporating sales at the high end. They're resorting to marketing gibberish to add new features to the product line, while simultaneously offloading R&D and manufacturing to a partner.

    Remind me again why Sun is in the hardware business?

    Thanks,
    Matt

    --
    me@mzi.to
    1. Re:This is just Multi-core processing... by mzito · · Score: 4, Informative

      And actually, this makes me so grumpy that I forgot the whole other piece.

      Despite the fact that Sun markets the UltraSparc IV as a single processor, software licensors like BEA and Oracle require that you license their software PER CORE. This means that a "4 processor" UltraSparc IV requires 8 processor licenses for Oracle or Weblogic.

      Sun never tells you this, and consequently a lot of people suddenly get tagged with additional licenses if they get audited. BEYOND that, Sun tells people that they can "double their performance" by replacing all of their UltraSparc IIIs with UltraSparc IVs, not explaining that they are doubling their performance because they're doubling the number of processors, AND that doing that upgrade can put them on the hook for literally hundreds of thousands of dollars in software cost.

      We've seen a number of companies get bitten by that, and it is downright disingenuous of Sun.

      Thanks,
      Matt

      --
      me@mzi.to
    2. Re:This is just Multi-core processing... by philipgar · · Score: 4, Informative

      You miss one of the major points in the article, and that is that CMT is not really about the Ultra IV being a fully CMT processor. This is about the Niagra chip. The Niagra chip is truely a CMT processor.

      The reason this is so is because it functions as both a chip multi-processor and as a multi-threaded core (although I think I'd consider their multi-threaded cores to be fine-grained multi-threading rather then SMT but thats a different story altogether). While IBM's power5 offers these same advantages (dual core, 2 way SMT cores) this is 4 threads per processor and not overly impressive.

      The Niagra chip in comparison to IBM (and upcoming Intel dualcore/SMT designs) is based on the assumption that at higher clock speeds the cpu is rarely fully utitlized (while the P4 can retire up to 3 instructions per cycle many apps, particularly data-intensive apps have an IPC of less than 1). The chip contains 8 cores with 4 threads being executed on each core. This means 32 threads can run concurrently. Sure no single thread will run as fast as it would on a NetBurst, athlon64, or power chip, but the combined throughput is enormous. Assuming each runs at ~ 1/4 the speed of their counterpart, that still gives us 8 threads on a single chip. This is enormous, and will have a major impact on database design (I'm currently doing research on SMT's effect on database algorithms) and the payoffs can be great (as can standard prefetching).

      I wouldn't reccomend writing off CMT as a marketing buzzword etc. The era of throughput computing is upon us, lets just hope Oracle and the other per-processor vendors change their liscencing to something that correlates with TPC performance or some other metric that still has meaning, otherwise companies are better off with a couple massively parallel single core chips that cost a whole lot more and generate a whole lot more power for the performance they produce.

      Phil

    3. Re:This is just Multi-core processing... by 0xABADCODA · · Score: 3, Informative

      Sun's CNP is modeled after Tera's MTA architecture (now named Cray again), which trades memory latency for throughput. Basically, in MTA (massively threaded architecture) each of 128 processor threads issues a few memory fetch instructions and waits for the memory to arrive (dozens to hundreds of cycles). This happens for every thread so the effect is that memory fetches and execution time are separated... iow time=max(execution,fetch) vs time=exeuction+fetch of normal processors. This also makes having a pipeling irrelevant so no effore is wasted in branch prediction.

      That's great for scientific apps since they are massively parallel... Sun has taken the same idea and scaled it down to 4 overlapping threads so normal applications can benefit. While it can be used to run 4 separate process threads at a time, at least the MTA's is fine-grained so that what really happens is that the compile changes a for (;;i++) loop into four (;;i+=4) loops and runs them in parallel.

      This technology done right means a massive performance boost (as in like 25-50%) while also simplifying the processor. Contrast that this Hyperthreading, which complicates the processor and only gets ~5-8% benefit on average... it's mostly designed to minimize context switch times.

    4. Re:This is just Multi-core processing... by mzito · · Score: 2, Insightful


      Oracle's been talking about reworking their licensing for a long time, and I agree licensing by core is sub-optimal. However, Oracle is being forthright that they charge by core, while Sun is _hiding_ the fact the USIV _is_ a multi-core processor.

      Sure, Oracle are the ones charging per processor core, but Sun is the company that is selling this upgrade as a painless, cost-effective way to upgrade their infrastructure. I firmly believe they are being negligent in not warning customers that this is a multi-core architecture - if you go to Sun's site and look at how its sold, they pitch it as one processor, one core.

      Imagine you're a customer - you spend $100k on Sun's new processors as a "painless" 1-1 upgrade, and suddenly find out that the first 100k has put you on the hook for 150k in new licenses. Wouldn't you feel like you'd been misled?

      Thanks,
      Matt

      --
      me@mzi.to
  24. way to get it wrong by CaptainPinko · · Score: 5, Insightful

    As many others have already pointed out, Intel has had Hyperthreading available in Pentium 4 and Xeon CPUs for a couple of years now, which does exactly what the article is talking about.

    As many others know, you know exactly nothing about what you are talking about. HT has basically two sets of registers so that during a cache miss which would cuase a bubble the chip switches to the other set so it doesn't sit idle. Suns chip on the other hand actually have multiple corses physically doing work at the same time. In fact were it not for Intel's hideously flawed NetBurst architecture the hideous hack that is HyperThreading would not provide any preformance increase at all (in fact it doesn't as much provide an increase as much as negate a decrease...). For evidence consider how many Pentium Ms have HT on them... Now I may not be fully correct but I didn't volunteer a comment; I only posted to prevent the misinformation of others. You'll find more on ArsTechnica. I'd link to the article but I can't find anything on their redesigned site.

    --
    Your CPU is not doing anything else, at least do something.
    1. Re:way to get it wrong by at_18 · · Score: 2, Interesting

      As many others know, you know exactly nothing about what you are talking about.

      Dude, you don't know anything either. P4's hyperthreading is a two-threads implementation of Simultaneous multithreading. Niagara is an 8-way multiprocessor on a chip, and each processor has four-way simultaneous multithreading, exactly like the P4, just with more threads.

      Regarding the amount of concurrent threads, it's basically equivalent to a 16-way Xeon server with hyperthreading enabled, but with much faster inter-processor communication (since it's all inside the same core), and of course much lower cost, heat dissipation, etc.

    2. Re:way to get it wrong by Lemming+Mark · · Score: 2, Informative
      HT has basically two sets of registers so that during a cache miss which would cuase a bubble the chip switches to the other set so it doesn't sit idle.

      I'm sorry but that's not correct. What you refer to is known as "Fine Grained Multithreading", or the product name "Superthreading".

      Intel's product "hyperthreading" is also known as "simultaneous multithreading" and is able to run multiple instruction streams simultaneously in order to maximise use of the functional units when they are not saturated by a single instruction stream. This is in addition to avoiding complete stalls on cache misses.

  25. Re:Hyperthreading by PitaBred · · Score: 2, Interesting

    Try make -j5 or -j6. Tends to have better results than the -j4 on my dual Xeon rig. And yes, I have benchmarked it.

  26. Re:i dont use multithreading by fitten · · Score: 2, Interesting

    Cool, we did a bunch of research back in the mid 90s using MPI and published some papers about threaded communications and the like inside of MPI implementations. Also, it was common practice back on the i860 paragons with two or three processors per node to devote one of the CPUs totally to communications while the other cranked away.

    Also, be careful that you take the working set into consideration. Suppose you had one processor with 1M L2 cache but your problem needed 1.5M data to work on. It runs at around main memory speeds. However, take two such processors and (if you can) divide the data in half, each processor can now fit all of its data inside L2, which runs at L2 speeds. You can see superlinear speedups that way too.

    However, what you are saying is pretty much right on... communication overhead is almost all integer work so if you have an FPU compute thread going on and have the communications offloaded to a thread, those two things should play quite nicely on hyperthreaded Intel parts. This is even cheaper than the other past solutions of burning an entire CPU for communications while the other does computation.

  27. Re:i dont use multithreading by ComputerSlicer23 · · Score: 2, Informative
    if one thread is in a wait state waiting on disc activity, then the CPU should jump threads

    Your using the wrong word in there. Where you use the word "thread", you should be using the work "process" in UNIX parlance. What you are describing is "multi-tasking" in roughly a generic sense. It wasn't invented with the i386, try sometime in the 1960's (I'd have to crack out an OS book to be sure of the date).

    Threads are different then processes.

    Fundamentally, the standard definition of a thread is: "Seperate control of CPU, with the same VM space". Essentially, it's two processes who have precisely the same memory mapped. (I'm sure there are lots of details I just glossed over, but essentially, that's it). On thing this leads to is blowing the L1 cache if you have several threads interacting on the same pieces of memory.

    Threads have lots of performance problems, but they also greatly simplify programming as you pay a lot less attention to the "shared memory" aspect of it. You add some locking, and then essentially multiple threads of execution can work on the same bits of memory.

    However, you can do roughly what you described at the process level. Apache used to have a significant patch for "State Threads", that used various OS primitives to tell if an OS call would be blocking or not. If it wasn't going to be, it would make the call. If it was, it'd move on and see if there was more interesting work it could do rather then blocking.

    Threads a huge performance win, because any time you have multiple independent tasks they can be performed in parallel assuming you have enough CPU units. The problem is most computer problems have areas where there are separate areas where they can work on, and then they have other areas where they have to sync up and work serially thru some portions. At points, it's the overhead of spawning threads, and syncing is a losing proposition. However for a lot of things, it's the obvious way to speed up performance (GUI applications, it's nice to have on thread that works at keeping the screen up dated, while another is fetching data to display on the screen, thus avoiding applications that feel non-responsive because the window won't refresh for long periods of time while data is being fetched).

    This sounds roughly like, they are adding hardware support for threading just like the TLB hardware got added to make VM run at a sane speed. It's fundamentally we have this cool stuff we do in software, the sucks speedwise becuase the hardware is bad at X.

    Kirby

  28. Sun's new chips by Anonymous Coward · · Score: 3, Interesting

    Sun's upcoming "Niagra" chips are supposed to have eight cores, each core being able to execute four threads. So that allows upto 32 threads executing at once -- on one physical chip.

    And we're not talking about "HyperThreading" where one of the CPUs is virtual. It's a real execution unit.

    And Intel and AMD are talking about dual-cores?

    This should help save space and energy (both in the power needed to run the box, and in running the cooling system).

  29. I buy the thoughtput / speed arguement, but... by PotatoHead · · Score: 3, Informative

    there are still some applications where raw CPU speed matters.

    We have been at the thoughtput is good enough point for several years. In truth, this is old news really. I've got IRIX servers doing lots of things plenty fast, clipping along at a brisk 400Mhz. There is not much you can't do with that, particularly when running a nice NUMA box.

    I assume the same holds true for SUN gear. (I think their NUMA performance is a bit lower than the SGI, but I also don't think it matters for a lot of enterprise stuff.)

    One application I have running, NUMA style, is MCAD. It's cool in that I have one copy of the software serving about 25 users, running on a nice NUMA server that never breaks. Admin is almost zero, except for the little things that happen from time to time --mostly user related.

    However, I'm going to have to migrate this to a win32 platform. (And yes, it's gonna suck.) Why? The peak CPU power available to me is not enough for very large datasets and I cannot easily make the data portable for roaming users. (If there were more MCAD on Linux, I could do this, alas...)

    Love it or hate it, the hot running, inefficient Intel / AMD cpu delivers more peak compute than any high I/O UNIX platform does. And it's cheap.

    Sun is stating the obvious with the whole I/O thing, IMHO. In doing so, they avoid a core problem; namely, peak compute is not an option under commercial UNIX that needs to be. (And where it is, there are no applications, or the cost is just too high...)

    This is where Linux is really important. It runs on the fast CPU's, but also is plenty UNIXey to allow smart admins to capture the benefits multi-user computing can provide.

    Linux rocks, so does Solaris, IRIX, etc... The difference is that I can get IRIX & solaris applications.

    WISH THAT WOULD CHANGE FASTER THAN IT CURRENTLY IS.

  30. Re:Hyperthreading by SunFan · · Score: 4, Interesting


    "Intel has had Hyperthreading available in Pentium 4 and Xeon CPUs for a couple of years now, which does exactly what the article is talking about"

    You are wrong. Period. Sun's CMT is several independent CPU cores on the same die with a huge bandwidth interconnect on-die. Intel's Hyperthreading is a gimmicky technology that has a very small real-world impact on performance.

    And your personal "benchmarks" cite no numbers. I be trolled!

    --
    -- Microsoft is the most expensive commodity operating system and office suite vendor in the marketplace.
  31. what MT means to developers by fred+fleenblat · · Score: 3, Interesting

    While conceptually unrelated, I put threads into the same mental category as untyped pointers. They are extremely powerful, but a complete PITA to debug if anything goes wrong, even moreso if you are maintaining someone else's void* or pthread_create filled application.

    What I've always done is code extremely defensively:
    1. make the various threads data-independent enough to be free-running and only co-ordinate at the start and finish of a thread's activity. If necessary, re-architect everything in sight to make this possible.
    2. when interaction is required, get a nice big coarse-grained lock and do everything that needs to be done and get it over with. profile it; there's a good chance it'll be over with quickly enough that it won't erase gains from parallelism or at least you can see what's taking so long and move it outside the lock.
    3. do TONS of load testing with lots of big files and random data. thread-related bugs can often hide for years in your code. Unlike divide by zero or null pointer references, a thread bug won't necessarily give any kind of hardware fault or exception. You have to go hunt for the bugs, they won't just pop up and say hi here i am.
    4. If you have multiple people of various technical abilities working on the code, you should add a grep/sed script to your makefile to check for accidental introduction of mt-unsafe library calls (strtok, ctime, etc). Flag new monitors and locks for review. Warn about dumb things like using static or global variables.
    5. Last trick is to use a layer to allow your program to be compiled for fork/wait, pthread_create/pthread_join, or just plain old co-routine execution (esp if there is a socket you can set to non-blocking). In addition to being able to test your code for correctness in various situations, you also have a baseline to see if the multithreading is an actual improvement.

    With the obvious exceptions for embarassingly parallel algorithms, I've found that humdrum client/server or middleware stuff:
    (a) gets only marginal gains from multithreading
    (b) you have to work for it--profiling and tuning are still required to get top-notch performance
    (c) effectient scaling beyond a handful of threads is the exception not the rule. If you have more threads than CPU's, it's a simple fact that some of them are going to be waiting and then your scaling is done.

  32. Re:Wrong. by pclminion · · Score: 2, Insightful
    Are you being purposefully dense?

    When a person says something, the intended meaning is not ambiguous (unless you are a poet), although the words used to describe that meaning may be.

    In this case it was intended to mean "What does it mean" and absolutely nothing else, your grammatical writhings notwithstanding.

  33. Like hell it is by turgid · · Score: 5, Informative
    From the lack of non-Sun-supplied buzz regarding this technology, it would appear that many people aren't finding it very exciting.

    More like none of Sun's competitors have anything which comes remotely close.

    Notice how nearly a year after Sun announced this, intel finally admitted that clock frequency (i.e. gigahertz) isn't everything and that they'd be bringing out dual core processors?

    Niagara has 8 cores each capable of 0-clock cycle latency switching between 4 different thread contexts.

    Who else has working hardware and an OS to go that can do this?

  34. Complementary concepts by WebMink · · Score: 2, Insightful

    Since the Pentium 4 according to Intel, but it's not a good question as that's Intel's trademarked term for their two-thread implementation of simultaneous multithreading:

    Simultaneous multithreading allows multiple threads to execute different instructions in the same clock cycle, using the execution units that the first thread left spare.

    By contrast, Niagara is implementing Chip-level multiprocessing:

    CMP is SMP implemented on a single VLSI integrated circuit. Multiple processor cores (multicore) typically share a common second- or third-level cache and interconnect.

    In other words, Niagara implements in hardware, at greater scale, what Pentium 4 offers as an emulation feature. In theory one could SMP on top of CMP chipsets for even greater throughput. If you find the Sun article too hard, the Wikipedia references I have cited will probably prove much easier to understand.

    1. Re:Complementary concepts by WebMink · · Score: 2, Insightful
      Actually the wikipedia article doesn't say that. It says:
      Sun Microsystems, in contrast, considers its UltraSPARC IV to be a multi-threaded rather than multi-processor chip. Intel agrees with Sun. This is not an idle debate, because software is often more expensive when licensed for more processors.

      Sun refers to the architecture as Chip-level Multi-Threading (CMT) and according to the white paper, while there are indeed multiple cores, each can also multi-thread:

      Sun's CMT processors will also have multiple cores on a single piece of silicon, with each core being able to process multiple threads, as shown in Figure 1.5. As a result, a single CMT processor will be able to process tens of threads simultaneously, exponentially increasing the amount of data processed each second.

      Cache also seems to have been considered:

      Shared chip resources such as large amounts of cache are designed to speed communications between cores to streamline parallel processing of threads.

      So while breathless enthusiasm may not be in order, a certain level of optimism seems warranted :-)

  35. Ada by Zygfryd · · Score: 2, Informative

    In fact I do know a better language, Ada95/2005.
    It's simply meant for threading and unconventional compiler optimizations (through the enforcement of constraints), while still being imperative and having a familiar syntax. And it's meant to be compiled unlike Java.
    Here's a site about Ada and here's another one.
    A good (alas not perfect) Ada95 compiler is included in GCC 3.4.

    So aye, we are ready for the CMT systems.

  36. Games in general would *LOVE* this if done right. by phorm · · Score: 3, Interesting

    Actually, when you think about it an improved threading model would actually strongly benefit well-programmed games. Why? Because there are a lot of semi-related processes occuring. Sound, graphics, physics, etc etc... they're all part of the game but work in very different ways.

    Now if you're working with a multithreaded CPU, one processor can be handling your CPU-bound graphics work (much of this is handed off to the video card anyhow), another can be doing sound/surround mixing, etc.

    In an FPS with complicated AI, you could theoretically hand that off to CPU #2 while #1 is handling different things. Your graphics engine might not have ugly-mofo-alien #235 onscreen to render, but meanwhile he's watching you and looking for a boulder that will offer him good cover to snipe you from instead of just sitting like a drone waiting for a computer-acurate headshot.

    Now let's say that PC's going multi-CPU. Maybe you don't need a single superpowerful processor, just a videocard and a few lower-powerful processors. Processor #1 is handing off the environmental data, #2 is prepping it for rendering and shovelling your GPU full of vertices, #3 is playing pinpoint surround for that cricket chirping behind the rock on your far left, and #4 is doing AI for ugly alien mofo #287.

    When I think about how games are advancing a lot can come down to interprocess communications and/or bandwidth limitations. The GPU still handles much of the video stuff so your CPU isn't really a bottleneck there in many cases, but as internet connections speed up then you're going to have MMORPGs, FPS's, and more chock full of "actors" that make up sight, sound, physics, and AI that could very well benefit from more CPU's rather than extra ticks on your overclocked single processor.

    After all, eye-candy is only a part of realism. True realism is also very much about a multitude of things happening at once.

  37. Re:Hyperthreading by BigZaphod · · Score: 2, Insightful

    Make a comment and ask a question and get marked as troll.

    Go figure.

  38. Re:FTA/RAIABSF instead of FT/RAID? by SuiteSisterMary · · Score: 2, Insightful

    Ah, but when you have one physical 'chip' that actually consists of four processor cores, you *can* do four simultanious tasks on one processor.

    The advantage over good old fashioned SMP? Well, probably the interconnect is way faster, and if the cores all share some cache or something, sibling threads should see some benefit.

    --
    Vintage computer games and RPG books available. Email me if you're interested.
  39. HyperRAM Technology! by MrNybbles · · Score: 3, Interesting

    I think the most interesting part of the article was when it said "Processor speed has increased many times -- it doubles every two years, while memory is still very slow, doubling every six years."

    So maybe it would be more efficent for people to stop screwing around with new processor design ideas for a while and put a little effort in doubling the speed of memory access (and I don't mean by using level whatever caches). Selling motherboards with a faster memory bus would be easy, just give it a cool sounding name kind of like Sega's "Blast Processing". Let's call it "HyperRAM Technology!"

    --
    Losing faith in humanity one person at a time.
  40. Paper on multithreading by Richard+W.M.+Jones · · Score: 2, Interesting
    It's not a particularly new idea. I wrote a pretty detailed paper at university about multithreading. You can read it here:

    http://www.annexia.org/tmp/multithreading.ps

    Rich.

  41. sounds hyper by pyrrho · · Score: 2, Funny

    all you need is the ability to run processes... which I do right here.... on this abacus...

    --

    -pyrrho

  42. Bigger version of an existing idea. by Bruce+Perens · · Score: 2, Informative
    My Pentium 4 processor has 2 threads. Linux treats them as 2 processors, and makes full use of them. Yes, it's cool to have 8 cores and 4 threads per core. But this is all about price/performanace. An 8-core chip that shares the cache, VM infrastructure, and memory interface between all cores is going to work best for CPU-intensive tasks that are not also I/O or memory-intensive and can be partitioned into multiple threads easily. Not photorealistic rendering, for example, that requires too much data. And it won't handle separate-process loads as well as 8 blades would. So, watch all of those parameters: price, power, memory bandwidth, how cache and VM are done, PC board real-estate, etc. How they are combined will tell you whether that chip is a win or lose for your application.

    Bruce

    1. Re:Bigger version of an existing idea. by Bruce+Perens · · Score: 2, Informative
      It's designed for efficient, high-density, highly multi-threaded blade applications..

      Um. The whole point of blades is that they don't have the expensive interconnect. So, here's an architecture with the interconnect in the chip, which would make it cheaper. But it's still not clear to me that a big system built around one die that uses transistors really efficiently is going to be less expensive than eight smaller systems that don't use their transistors as efficiently.

      Bruce

  43. Re:really....? by Bruce+Perens · · Score: 2, Informative
    i thought intels HT processors still only had one execution unit. They just have two fetch and decode processes and fast context switching between them.

    I wasn't even assuming they have that much. The minimum you need to make this trick work is two independent contexts. That means two copies of all kernel-visible control and data registers. You would probably not need to save internal microstate unless you need it to restart a long-running instruction.

    Anything else on top of that is optimization.

    Bruce

  44. Bleah. FUD. by Cryptnotic · · Score: 2, Informative
    #pragma omp parallel for private(sum) reduction(+: sum)
    Bleah. FYI, I'm pretty sure GCC will reject this. Even the newest versions.

    I just tested it with GCC 2.95.3, 3.2.1, 3.3, and 3.4.2, and it works fine. Of course, GCC is just ignoring the #pragma. I didn't know about OpenMP before this, but it does look like a good way to "optimize later" and have your code still compile with gcc. And you don't have to write and maintain two different versions separated by #ifdef, #else, #endif.

    --
    My other first post is car post.
  45. Re:Hyperthreading by owlstead · · Score: 2, Funny

    This is not college. Slashdot does not start with "there are no stupid questions". There are, you asked one, AND it was already more covered than the genitals in a tiroller soft sex movie.

  46. Re:really....? by ckaminski · · Score: 3, Interesting

    The only thing intel's hyperthreading buys you, and what most symmetric multithreading implementations buy you, is a solution to the cache miss problem. If your pipeline stalls, you simply execute the next thread in the list until you get the data you need.

    Now, in some sophisticated designs, which is what I'd expected the P4 to do, was to turn the extra parallel execution units into independant ones, so you could issue 2 or 3 instructions simultaneously, and forgoe all the branch prediction, etc.

    Turns out that the P4 20 stage pipeline needed help. SMT/Hyperthreading was it.