Slashdot Mirror


Emergence of SMT

yellow writes "SMT, or Simultaneous Multithreading, is a concept that is rapidly gaining adherence in the microprocessor area. It essentially allows for a single processor with multi-processor capabilities in both ILP (Instruction Level Parallelism) and TLP (Thread Level Parallelism). When comparing SMT vs. dual or multiprocessor performance data, it is important to compare apples to apples, and understand why under an OS such as Win 98 or ME, even a single Pentium III 1 GHz will handily outperform a dual-Pentium III 500 MHz setup. This is the discussion topic of a new feature on HWC."

37 of 104 comments (clear)

  1. SMT is really the reaction to the death of PC's by heroine · · Score: 2

    The processor world right now is 99% driven by embedded systems. You need to put multiple cores on the same die to boost embedded performance since there's not enough room for multiple processors. Now that PC's are dead we're seeing a spike in SMT press releases even though the technology has been floating around forever.

    The fact that Intel's last processor release was a "mobile processor", Intel's future x86 vapormap includes pure SMT chips, and Compaq's future vapormap includes an SMT alpha shows how important that size reduction is.

  2. Re:Multi Processors under Win9x by mattdm · · Score: 2
    Or the Linux 2.4 kernel.

    --

  3. Time for some education on computer architecture.. by slothbait · · Score: 5

    > SMT? Blow it out your ass.

    Rather unnecessary, don't you think? I don't think you would be quite so hostile to SMT if you had a better understanding of it.

    Time to do some informing...

    > Surely this 'virtual-multi-CPU' system can only decrease the sheer number of operations per second a CPU of a given size/speed can do?

    This statement doesn't make much sense. I think you mean that a CPU which spent the die area on additional functional units as opposed to SMT support could achieve a greater MIPS value. This is true, but only for theoretical MIPS. Simply adding more functional units to modern chips, would *not* improve actual performance. Explanation follows...

    > The overhead - whether it be in sacrificed MIPS or die area, of distributing instructions among execution units is going to be significant, compared to a maxed-out single core design.

    So you are stating that implementing SMT comes at a cost in die area? Of course it does, but the important point is that using that die area instead to add more conventional execution units would *not* increase the performance of the processor. Why, you ask? There is a limited amount of instruction level parallelism available in a single thread of program execution. Current wide-issue superscalars get something on the order of 2.3 dispatches per clock, despite the fact that they have the *capability* to issue far more. The processor simply can not find enough independent instructions to keep it's functional units full. If memory serves, the Athlon is a 9-issue core. You could add functional units up to 12-issue or more, but your actual dispatch rate would still be around 2-3 per clock. While your theoretical performance would increase your actual performance would remain stagnant.

    So, current software does not exhibit enough parallelism to keep the functional units in even current processors busy. SMT proposes to increase available parallelism by issuing instructions from *multiple* threads at once. Instructions from different threads are guaranteed to be independent, so if you have n threads running at once, your number of available instructions for dispatch each clock is improved about n times. Of course, this method has a cost in complexity and area -- now the CPU has to have knowledge of threads, and keep a process table on die. However, provided many threads are run at once, this *greatly* increases the utilization of the processor's resources, and thus the performance of the part.

    > Since reading and writing to various RAM caches are the biggest bottlenecks in the current PC architecture, adding more units is just going to lead to increased contention for these resources.

    ...this is a valid point. SMT particularly increases the burden on instruction fetch and cache, since it is pulling from several different streams at once. However, there are methods that can somewhat compensate for the contention of resources introduced. Now, you have multiple threads available at all times. So, when one thread stalls on a cache miss, the processor can dispatch a different thread to run while the cache miss on the first is being serviced. This effectively hides the latency of the cache miss since the processor is able to do useful work during the service. You see, it's all about keeping those functional units busy.

    > So many CPU cycles are wasted with the current generation of software that it seems a bit pointless increasing the number of potential instructions you could perform..

    If you believe this, then you should be pro-SMT. SMT doesn't address increasing potential instructions performed per second. Instead, it is an attempt to close the gap between *actual* performance and *theoretical* performance by keeping more of your processor busy.

    >you have to question the thinking behind such a modification.

  4. Re:SMT != ILP != multiple pipes. by Mr+Z · · Score: 2
    SMT is a relatively new idea that lets you easily boost the instruction-level parallelism, which in turn makes scheduling and issuing instructions *much* easier.

    That's true only if you have sufficient hardware registers (not to be confused with architectural registers), and your tasks aren't bottlenecked on memory. If you have truly independent tasks, then each will effectively see half as much cache at all levels of the hierarchy. This can get especially painful in L1I -- if you can't keep the CPU fed from both streams, oops! And, large register files can be a speed limiter in the architecture. (On the plus side, though, the hardware register files can be distinct between the threads, so it's not too bad. As I understand it, it's the unified register files that are the real problem.)

    One of the main attractive things I see in SMT is that you can effectively make the pipeline deeper on the architecture while completely hiding it. This is important on VLIW-style architectures that have an exposed pipeline. To make it deeper, you need to somehow hide the fact that it's deeper than the code thinks it is. One way is to add interlocks (gradually making stages of the pipeline protected, rather than exposed). Another is to interleave multiple threads, so that each thread sees the pipeline length it expects, but the actual pipeline is some factor larger.

    Of course, in this VLIW world, things can get tricky outside the CPU. Because you lack superscalar issue (that's what VLIW's about), the issue of stalls becomes a problem. "One-stall-all-stall" is an oft-mentioned SMT VLIW technique, and with it, you really need to make sure you aren't bottlenecked on memory before you go down the SMT path. "One-stall-all-stall" means if one SMT thread stalls, all threads stall... As I understand it, it's the "cheap" way to maintain the VLIW state in an SMT VLIW machine, but it also amplifies any memory system bottlenecks you might have.

    --Joe "Mr. VLIW"
    --
  5. Re:Back to the Future by artdodge · · Score: 2

    Some would argue that EPIC (the ISA for IA64 chips like Itanium) is a fundamental change in hardware design - it combined VLIW (an old idea) with explicit encoding of inter-opcode dependency, among other things. That kind of explicit "helper" data for internal ILP engines could end up proving valuable, if the compiler technology can keep up.

  6. CISC, RISC, and VLIW. It's very simple. by landley · · Score: 2
    The bottleneck isn't inside the processor, we can clock multiply those suckers into the stratosphere. A modern Pentium or Athlon is going ten times the bus speed, and that can be increased easily. As long as you're executing out of L1 cache inside the processor, clock multipying is a big win and better design can go hang.

    The BOTTLENECK is the memory bus. Refilling the cache from a hose that's 1/10th the speed of the processor. That's why CISC lasted so long in the first place, and is still with us today. CISC has variable length instructions. If you can express an instruction in 8 bits, you do so. 16 for the more complex ones, 32 bits for the really complex ones. So when you're sucking data into the cache 32 bits at a time, you can get 2 or 3 instructions in a 32 bit mouthful. (Or, in the case of pentiums, 64 bits to feed 2 cores, but the principle's the same.) You're optimizing for the real bottleneck with compressed instructions.

    The fixed length instructions of RISC can be executed 2 at a time because you don't have to decode the first one to see where the second one starts. But Sparc, PowerPC, and even Alpha haven't displaced Intel because the real bottleneck is the memory bus, and bigger instructions aren't necessarily a win. (That and Intel translates Cisc to Risc inside the cache, and pipelines stuff.)

    VLIW as iTanium picked it up sucks so badly because the real bottlneck is sucking data from main memory, and now they want 192 bits of it per clock! For only three instructions, and on average at least one will probably be a NOP. Crusoe has a MUCH better idea, sucking compressed CISC instructions in and converting them to VLIW in the cache (like Pentium and friends do for CISC to RISC).

    This multi-threading stuff is just a way to keep the extra VLIW execution cores from being fed NOPs. They don't deal with the real problem, the memory bus de-optimization by reverting to full-sized instructions all the time.

    Rob

  7. Re:Too Many TLAs [ot] by wik · · Score: 3

    This is perhaps one of the most useful sites in today's world of technobabble: www.acronymfinder.com. It lists 19 different meanings for "SMT", none of which are Simultaneous MultiThreading! :-)

    --
    / \
    \ / ASCII ribbon campaign for peace
    x
    / \
  8. "Bollocks" ? by Christopher+Thomas · · Score: 2

    IMHO, SMT is a load. Modern microprocessors are mostly cache-starved. SMT puts two processors on the wrong side of the L1$, aggrevating the cache bandwidth problem. Worse, the two processors in SMT degrade referential locality, further degrading the performance of the cache.

    You overlook a couple of very important factors.

    First of all, it would cost you almost no extra silicon or latency to have duplicate L1 caches, and to add a selection bit to the addresses sent out on memory operations.

    Secondly, technologies like SMT help _save_ you when you have a cache miss, because you still have an instruction stream that can execute while one thread's waiting for data.

    1. Re:"Bollocks" ? by Christopher+Thomas · · Score: 2

      My arguments are coming from the "memory wall" perspective of system performance. CPU cycles are no longer the problem: the problem is getting enough data to the CPU core.

      And this depends entirely on your workload. Many tasks are memory-bound - and many are not. Generally, anything that can fit in the on-die caches will be CPU bound (for most cases). This still covers a wide range of useful problems.

      The gap between memory speed and CPU speed is caused by DRAM latency and system bus speed, neither of which are issues for on-die caches. If clock speed increases and die sizes stay the same, propagation latency will become an issue, but SMT is great for alleviating _that_, too; as long as throughput scales with clock speed, you can tolerate higher latency by interleaving requests from different threads.

      In summary, I think that memory bottleneck problems aren't as severe as you make them out to be. Yes, they're very relevant for programs that work with large data sets, but that by no means covers all tasks we want computers to perform.

    2. Re:"Bollocks" ? by Christopher+Thomas · · Score: 5

      It's not true that doubling L1$ and adding a selection bit costs you nothing. In fact, the size of L1$ is rather limited, and cutting size in half substantially increases the miss rate. It is also fairly expensive to add selection bits.

      Um, no.

      Most of your die is taken up by the _L2_ cache. You have plenty of space to add more L1 cache. The reason you usually don't is that a larger L1 cache served by the same set of address lines has longer latency. Two independent duplicates of an L1 cache will behave identically to the original L1 cache.

      Performing the selection adds latency, but this can be masked because you know the value of the selection bit long before you know the value of the address to fetch.

      In fact, you'd almost certainly _reduce_ the cache load compared to a single-threaded processor capable of issuing the same number of loads per clock, because they'd be hitting different caches, and you wouldn't have to multiport.

      SMT also doesn't save you from cache miss latency. Out-of-order instruction issue saves you from that.

      SMT, in any sane design, is used on an OOO core. An OOO core won't save you if your next set of instructions has a true dependence on the value being fetched from memory. SMT gives you a second thread with no data dependence on the stalled load, and hence plenty of instructions in the window that you can execute while waiting.

      I'm having trouble seeing where your arguments are coming from. As far as most of the core's concerned, there's still only one (interleaved) instruction stream, just with less data dependence in it. This is scheduled and dispatched as usual.

    3. Re:"Bollocks" ? by barracg8 · · Score: 2
      • SMT also doesn't save you from cache miss latency.
      Please enlighten us.
      I'm sat here working on the software side of an SMT project. This is exactly where SMT offers benefits. The processor switches threads on a cache miss. A 4-way SMT scheme can offer > 2x performance for 2x die size. A SMT procoessor cannot reduce the cache latency experienced by a particular thread of execution, but it can reduce the amount of time that execution units sit idle.
    4. Re:"Bollocks" ? by maraist · · Score: 2

      Put up a CPU monitor and see how much time your machine spends CPU-bound: in most cases, the CPU is very rarely fully loaded, and actually spends much of its time waiting for disk.

      Disk-IO should never be a problem. If it is, then you need to alter the system. If your work-station is disk-bound, then you simply add more memory, if your high-end server is disk bound, then you put in a very expensive RAID.

      Typical applications are Office suites, video games, web servers and databases. Well a game should never have to hit the disk except for level loads / movies. Office should only have a hit the first time you start it up(and nowadays that's reduced by boot-time prefetching). web-servers should _always_ contain enough memory to serve the majority of the web-site. And professional Databases will typically demand multiple expensive drives.

      Most other operations only require a moderate use of IO, which is further reduced by OS level caching. It's not hard at all to get 2 or more CPU-load doing things like compiles (which are obviously very IO bound). The CPU is more than taxed in such circumstances.

      All levels could beboth larger and faster
      There is most certainly a boost from memory latency and bandwidth enhancement, but I don't agree that these are the universal bottleneck.. Many applications are nicely optimized to fit within the half meg of L2 still commonly found. This applies to video games, heavy-duty compression tools, etc. For example, I got 99% performance boost on a dual-processor when doing MP3 encoding. Obviously main memory / disk-IO wasn't the bottle-neck.

      I think Lx$ becomes a bottle-neck when you context-switch often; thereby flushing the $.

      Unfortunately economically sound SMT isn't going to be as fast as SMP because you can't have fully redundant and optimized L2 (which will affect large-data-set applications). What I see happening is the use of thread-independant register sets and L0 cache (for the x86 processors at least), then having a large number of ports on the shared L1, and finally a minimally ported, though larger than current on-die L2. There would possibly be a very large off-die L3 (typically targeted at 2Meg).

      It won't help all applications (Apache 1.x or Postgres 7 certainly won't improve), but more and more Solaris and Windows applications could definately benifit (heck, even the traditional win-benchmark Quake 3 is MT aided).

      What I see as ideal is SMT/P interleaved memory. You use 2-way SMT to take up the slack where ILP bottles up. Then you have a second separate chip (possibly on the same cartraige, sharing an external L3 $ ). In this way, you have a minimal increase in complexity of the core, and you optionally sell more cores (kind of like the 3DFX's VSA-100 mentality.. Make the core simple and scalable). So long as you have a large enough $ and a minimal number of loading processes, putting all the cores on the same system-memory-bus shouldn't be a problem (thus alleviating the complexity of the EV8 point-to-point bus). I don't believe AMD's P2P architecture is going to be worth the added cost, delays, and complexity. Not to mention, main-stream memory can't handle the additional BW.

      --
      -Michael
  9. SMT != ILP != multiple pipes. by Christopher+Thomas · · Score: 5
    I thought that the major processor companies had been working with multiple execution pipelines for years now. Doesn't that fall under the category of ILP?

    You might want to doublecheck the terms you're using:

    • "ILP" is "instruction-level parallelism". It's not a physical part of the chip - it's a quality of the instruction stream. ILP is the number of instructions (usually average) that could theoretically be executed at one time, without violating data relationships within the program. Modern processors _can_ execute multiple instructions per clock because the ILP of most programs is greater than one (i.e. there are usually multiple instructions that can be executed without violating data or control dependencies).

    • "Multiple pipes" is part of the hardware that allows processors to issue multiple instructions per clock. As the name implies, this represents multiple hardware units that are capable of performing operations independently of each other.

    • "SMT" is "Symmetrical Multithreading". Remember how back under ILP, I said that the number of instructions that can be issued per clock depends on the parallelism of the program being run? SMT boosts the parallelism by running two threads at the same time and interleaving their instructions (more or less). As the instructions from different threads usually don't care what the other threads are doing, this gives you many more instructions that can be executed at the same time (assuming you have enough hardware to execute them).



    Multiple pipes are a relatively old idea. Ditto instruction-level parallelism, which is one of the analytical quantities used to judge how well multiple pipes will work in a given situation. SMT is a relatively new idea that lets you easily boost the instruction-level parallelism, which in turn makes scheduling and issuing instructions *much* easier.
  10. Back to the Future by Detritus · · Score: 3

    One of my favorite computer architectures is the CDC 6000 series. It had a Peripheral Processor (PP) that did all of the system I/O. The main CPU crunched numbers while the PP dealt with the outside world. The cool thing about the design of the PP was that it appeared to be 10 independent processors, even though it only had one ALU, instruction decoder etc. This was accomplished by a "barrel" of 10 sets of CPU registers and memory banks. The PP would rotate the barrel every time an instruction was fetched and executed, turning one physical CPU into 10 virtual CPUs. This meant that the PP could simultaneously execute 10 different programs wihout having 10 hardware CPUs. I've often wished there was a microprocessor that could do this. It would be great for embedded real-time systems and I/O controllers. Each I/O device and/or subsystem could have its own virtual CPU, that would never get swiped by other tasks or I/O interrupts.

    --
    Mea navis aericumbens anguillis abundat
    1. Re:Back to the Future by Raven667 · · Score: 2

      *Sigh* . . . It seems that no one really does much research into CPU and computer system design anymore. All the major archetectures are pretty homogenized, they are either full RISC machines (probably a tad bloated with too many instructions) or RISC machines emulating a CISC instruction set (x86). RISC seems the last fundemental change in hardware design, the processors get smaller and faster but not much really changes.

      One thing that I like, at least as a concept, is the IBM S/390. The IBM Mainframe and its customers have been living in a pretty insular world over the last 20 years and the hardware/software that runs on this beast is just, different. Some things are goofy, IIRC it boots off an emulated punch card reader (!!), or nifty, like the magic migrating storage system.

      --
      -- Remember: Wherever you go, there you are!
  11. Re:high-class SMP on x86? but *why?* by Raven667 · · Score: 2

    While in many senses you are right I think you are pointing to the wrong issue. It is not something inherent in the x86 arch that causes problems in scaling it is mostly Intel's SMP bus design. Having all the CPU's share a single, shared, bus between each other and system memory is the bottleneck. I mean, look at the Athalon, it isn't riding on an Intel designed bus, it rides on a DEC desigend EV6, originally made for the Alpha.

    While Beowulf is a nifty technology, it does not solve all the scaling problems as you might think. Beowulf clusters are only useful for a specific subset of available problems, stuff that can be easily split up and sent to many, semi-independent, processing nodes. Beowulf clusters are generally connected together with 1Gb or 100Mb Ethernet which does not have high bandwidth or low latency compared to the CPU-Memory bus in even the cheapest computers. I would take a single 128 CPU box over 64 dual proc boxes connected via 1Gb Ethernet (or even Myrinet) any day.

    --
    -- Remember: Wherever you go, there you are!
  12. Re:CISC, RISC, and VLIW. It's very simple. by untulis · · Score: 2

    The BOTTLENECK is the memory bus.

    OK.

    That's why CISC lasted so long in the first place, and is still with us today.

    No way. CISC (i.e. x86) lasted so long because of duopoly action and backward compatibility. In fact, like you said, CISC is dead because even since Pentiums, x86 chips have been RISC on the inside and CISC to the outside world (to varying degrees).

    The fixed length instructions of RISC can be executed 2 at a time because you don't have to decode the first one to see where the second one starts.

    Or n at a time. Any OOO RISC processor these days worth its snot decodes 4 ops/clock, some are at 6 or 8. (If it can't retire that fast, it doesn't really matter...)

    Alpha haven't displaced Intel because the real bottleneck is the memory bus

    Really? For scientific computing, which is where you have really big datasets and memory bandwidth is key? I don't think you see x86 there very much. You see DEC, IBM, Sun and HP. Who are all, surprise, surprise, RISC-based hardware vendors. Many RISC chips (Alpha, POWER, PowerPC) have long since passed x86 in sheer performance, especially on FP. Intel has defintely won in price/performance, but I would argue that's more due to volume than anything else.

    This multi-threading stuff is just a way to keep the extra VLIW execution cores from being fed NOPs.

    Umm, Alpha EV8 uses SMT. Not VLIW. Itanium is VLIW-like. Doesn't use SMT. Example no worky.

    Not saying that the concept is wrong, that SMT as a concept might alleviate some of the performance issues with superfluous instructions in a VLIW instruction stream. But that's sort of the point of VLIW, to let the compiler, rather than OOO hardware, figure out how to best use the available functional units as much as possible. It puts NOPs in to keep the instruction stream balanced so the decoder can work in a predictable way just like in RISC.

  13. Re:SMT ... by Big+Jojo · · Score: 2

    Say what? Most applications can't fill a deep pipe, even out-of-order and with aggressive prefetching. The ways this stuff wins include having two (or maybe more!) instruction streams to crunch, and switching away from the one that's now blocking on a memory access. Prefetch on the other surely completed already ... The P4 is a good example of a pipeline that's too long.

    And by the way, why has this taken so long to arrive? It's still not something I can purchase yet, and I first heard of it back in 1992. There's something fishy.

  14. SMT? Blow it out your ass. by ikekrull · · Score: 2

    Surely this 'virtual-multi-CPU' system can only decrease the sheer number of operations per second a CPU of a given size/speed can do?

    The overhead - whether it be in sacrificed MIPS or die area, of distributing instructions among execution units is going to be significant, compared to a maxed-out single core design.

    Since reading and writing to various RAM caches are the biggest bottlenecks in the current PC architecture, adding more units is just going to lead to increased contention for these resources.

    So many CPU cycles are wasted with the current generation of software that it seems a bit pointless increasing the number of potential instructions you could perform..

    Its like putting a 700 cubic inch supercharged W16 engine constructed from 3 straight-8 blocks into a VW Kombi van.

    Sure, it'll theoretically go pretty fast, but when its parked by the side of the road 340 days out of the year and only ever driven by a bunch of hippies who are too stoned to see the road properly at 20 kmph, you have to question the thinking behind such a modification.

    --
    I gots ta ding a ding dang my dang a long ling long
    1. Re:SMT? Blow it out your ass. by ikekrull · · Score: 2

      Sure, but you need to use some of the parts from the third engine block to construct the more complex W-16 configuration, which would theoretically provide more usable power than, say, a straight-24.

      --
      I gots ta ding a ding dang my dang a long ling long
  15. Re:A nice article about SMT on the Alpha by be-fan · · Score: 2

    Yikes. The EV8 will dissipate 250watts! That's more than my monitor! Of course, watts are good ;) I want one (or four, this does SMP right?)

    --
    A deep unwavering belief is a sure sign you're missing something...
  16. Re:Sun's already done it by barracg8 · · Score: 2
    This is not SMT. To quote the website.
    • It includes 2 tightly coupled processor units
    In SMT, you have multiple threads of execution (eg. multiple PCs) feeding one CPU.
  17. SMT ... by taniwha · · Score: 2
    Surface Mount Technology .... of course all us chip weenies think packaging when you use that TLA.

    Seriously though threading like that is kind of at odds with todays very long pipelines (basicly the cost of a thread switch can be very high if you have to fill a deep pipe). With heavily out-of-order systems this can be less of a problem .... but you're still stuck with the problem that if you're using a larger percentage of the CPU's real clocks then you're going to put more pressure on shared resources like caches and TLBs - larger L1s/TLBs are going to potentially hit CPU cycle time and of course these days L2 can take a large percentage of your die size (after all the goal here is to get more usefull clocks/area)

  18. Dean Tullsen's papers by Seenhere · · Score: 4
    To dig into this a bit deeper, here's a link to some research papers by the guy who invented SMT (it was the topic of his PhD thesis back in 1996).

    For your bedtime reading, y'all.

    --Seen

    --
    "I used to be a dilettante. Then I thought I'd try something else for a while."
  19. SMT Explained by Carnage4Life · · Score: 3

    For those with a technical bent who were disappointed by the lack of information on SMT in the linked artilce, here are some better resources:

    Introduction to Simultaneous Multi-threading from UMass .

    Quick Quiz on SMT.

    Caches for Simultaneous Multithreaded Processors: An Introduction

  20. high-class SMP on x86? but *why?* by The_Messenger · · Score: 2
    When comparing SMT vs. dual or multiprocessor performance data, it is important to compare apples to apples, and understand why under an OS such as Win 98 or ME, even a single Pentium III 1 GHz will handily outperform a dual-Pentium III 500 MHz setup.
    With all due respect for my Intel-loving friends here on Slashdot, I feel that you'll never see much of the real advantages of SMP on x86 boxes anyway. Even if it were possible for Joe Average Computer Scientist to obtain an Intel box with more than eight CPUs, the severe bandwidth limitations of the x86 architecture become apparent with even four Pentium IIIs. Only machines with high-bandwidth architectures (most notably those from IBM's mainframe division, SUN, and SGI) are able to compensate for the exponential growth in processing overhead with more than eight CPUs, and it's no coincidence that these companies are where you turn to when you need a machine with between eight and sixty-four CPUs.

    My feelings shall be vindicated when SMP Athlon machines become readibly available. Their comparatively minor bandwidth advantage will let them blow similarly-clocked Intel boxes out of the water.

    Personally, I feel that the best way to scale x86 to supercomputing levels is through clustering, such as is offered by the venerable Beowulf for GNU/Linux. GNU/Linux, for better or worse, is continuing to grow in popularity, and I would like to see commercial software vendors try releasing Beowulf-enabled software for Linux. Imagine being able to buy Oracle for Beowulf! Okay, poor example; Oracle is memory-intensive rather than CPU-intensive, and a RDBMS is one application which is so dependant on a fast disk and good caching that that advantages pale in comparison to the potential problems. What would really be cool are Beowulf ports of statistical analysis and 3D-rendering software. Oooh, yeah... after all, The Matrix and Titanic have both proven the effectiveness of free x86 Unix-workalikes in render farms... I believe that those two movies respectively used FreeBSD and GNU/Linux.

    --

    --

    --
    I like to watch.

  21. Re:CISC, RISC, and VLIW. It's very simple. by lamontg · · Score: 2

    The point of SMT is that if one thread gets a cache stall because it has to hit main memory, then another execution thread has its instructions loaded into the CPU. SMT is actually one way to help reduce the CPUMEMORY botttleneck.

  22. Re:Bollocks by lamontg · · Score: 3
    I'd suggest you read this paper this paper or this paper

    Pay particular note to the fact that you can take an existing superscalar chip and add SMT for only about a 10% chip real estate premium, while it should be able to double throughput. That's a lot better than trying to double throughput by adding another CPU to a machine or by adding two cores to a CPU.

    Also note that it isn't recommended to run processes with different address spaces simultaneously on the processor because that would thrash the TLBs. Its only suggested that you let multithreaded apps (oracle, perhaps future versions of apache) load more than one thread into the processor at the same time.

  23. COPS, out-of-phase 6800's by dpilot · · Score: 2

    Back in the early-mid 70's people were taking two 6800 CPUs, wiring them out-of-phase, and essentially building tightly-coupled SMP systems. We didn't really have threading in those days, or else the correct OS could have made such a system SMT, instead.

    But someone else was designing another CPU, called the COPS. They looked at this well-known out-of-phase 6800 technique, and realized that their design basically used clock-up for fetch/decode, and clock-down for execute. During each half-cycle, half the CPU was sitting idle.

    So they doubled the registers, using the fetch/decode unit with one register-set during clock-down and the other register-set during clock-up. The execute unit worked in the converse fashion, alternating register sets. A dual CPU on a single chip for the cost of a second register set and a little control/arbitration logic. They didn't attempt any sophisticated contention-prevention, leaving that up to the software. This was mid-late 70's.

    With more modern software, COPS might have been the first SMT. I don't know the timeframe of the CDC6000, whether it beats mid-late 70's or not.

    --
    The living have better things to do than to continue hating the dead.
  24. strange by elegant7x · · Score: 2

    Well, maybe not by the time I get this up. Actually, win98 and ME don't support dual processors at all, so you're second one will just be sitting on the motherboard turned off.

    As far as SMT goes, I think it's a good idea (well, obviously, why wouldn't it be). You really can only get so much out of Instruction level parallelism, and I've always thought that splitting CPU time up by thread rather then by instruction parallelism would be a lot more effective.

    Rate me on Picture-rate.com

    --

    "and dear god does this website suck now." -- CmdrTaco
  25. there is only default :P by elegant7x · · Score: 2

    actualy, every pic on there is 'default' Right now, my pic is at http://picture-rate.com:8080/hello/viewpic.jsp?sys tem=happy&owner=default&picture=i170, although, that could cange. I would link directly to it, but chad hasn't put in the ablity to select which picture to rate, otherwise I owuld put that link in my sig.

    Rate me on Picture-rate.com

    --

    "and dear god does this website suck now." -- CmdrTaco
  26. Comment removed by account_deleted · · Score: 2

    Comment removed based on user account deletion

  27. Comment removed by account_deleted · · Score: 3

    Comment removed based on user account deletion

  28. As they say ... by Alien54 · · Score: 2
    Just peruse back issues of popular computer magazines and you will see that most new Intel processors were pigeonholed as "server solutions" until they inevitably migrated to the business and consumer PC markets. If the actual SMT performance data matches or exceeds the simulated results, then an SMT processor may become an increasingly attractive option for many buyers. SMT could also be the "shot in the arm" microprocessor companies are looking for. There is a malaise in the current market, along with growing consumer resistance to ever-increasing processor clock speeds. If the hardware and software portions of SMT come together, then the question of "why do I need a 1.5 GHz processor?' may be answered in very short order.

    Actually, while the first bit is true... let's face it, for most business applications we do not need faster machines until we have to deal with the bloat of the next road of software from the major vendors.

    Businesses could probably do very well on a single standardized set of software for a decade or more for most common functions. Many have done so as a matter of fact. There are some businesses out the still running win 3.1 apps.

    --
    "It is a greater offense to steal men's labor, than their clothes"
  29. Technology Imitates Life? by QuokkaNetGuru · · Score: 2

    Ah, At long last circuitry catches-up with functionality that women have been laying claim to for aeons.

    --

    People who say it cannot be done should not interrupt those who are doing it.

  30. Too Many TLAs by QuokkaNetGuru · · Score: 2
    • SMT = Society for Music Theory
      (http://smt.ucsb.edu/smt-list/smt-main.html)
    • SMT = Surface Mount Technology
    • SMT = Simultaneous MultiThreading
    And (drumroll) of course ....

    the Finland Travel Bureau (http://www.smt.fi/)

    When will the madness end?
    --

    People who say it cannot be done should not interrupt those who are doing it.

  31. Multi Processors under Win9x by stretch_jc · · Score: 4
    why under an OS such as Win 98 or ME, even a single Pentium III 1 GHz will handily outperform a dual-Pentium III 500 MHz setup

    This is because Win9x does not support SMP so even dual 933MHz will be out performed by a single 1GHz. A better comparison would with Win2k or Linux, which will both actually use both CPU's.