Emergence of SMT
yellow writes "SMT, or Simultaneous Multithreading, is a concept that is rapidly gaining adherence in the microprocessor area. It essentially allows for a single processor with multi-processor capabilities in both ILP (Instruction Level Parallelism) and TLP (Thread Level Parallelism). When comparing SMT vs. dual or multiprocessor performance data, it is important to compare apples to apples, and understand why under an OS such as Win 98 or ME, even a single Pentium III 1 GHz will handily outperform a dual-Pentium III 500 MHz setup. This is the discussion topic of a new feature on HWC."
The processor world right now is 99% driven by embedded systems. You need to put multiple cores on the same die to boost embedded performance since there's not enough room for multiple processors. Now that PC's are dead we're seeing a spike in SMT press releases even though the technology has been floating around forever.
The fact that Intel's last processor release was a "mobile processor", Intel's future x86 vapormap includes pure SMT chips, and Compaq's future vapormap includes an SMT alpha shows how important that size reduction is.
--
> SMT? Blow it out your ass.
Rather unnecessary, don't you think? I don't think you would be quite so hostile to SMT if you had a better understanding of it.
Time to do some informing...
> Surely this 'virtual-multi-CPU' system can only decrease the sheer number of operations per second a CPU of a given size/speed can do?
This statement doesn't make much sense. I think you mean that a CPU which spent the die area on additional functional units as opposed to SMT support could achieve a greater MIPS value. This is true, but only for theoretical MIPS. Simply adding more functional units to modern chips, would *not* improve actual performance. Explanation follows...
> The overhead - whether it be in sacrificed MIPS or die area, of distributing instructions among execution units is going to be significant, compared to a maxed-out single core design.
So you are stating that implementing SMT comes at a cost in die area? Of course it does, but the important point is that using that die area instead to add more conventional execution units would *not* increase the performance of the processor. Why, you ask? There is a limited amount of instruction level parallelism available in a single thread of program execution. Current wide-issue superscalars get something on the order of 2.3 dispatches per clock, despite the fact that they have the *capability* to issue far more. The processor simply can not find enough independent instructions to keep it's functional units full. If memory serves, the Athlon is a 9-issue core. You could add functional units up to 12-issue or more, but your actual dispatch rate would still be around 2-3 per clock. While your theoretical performance would increase your actual performance would remain stagnant.
So, current software does not exhibit enough parallelism to keep the functional units in even current processors busy. SMT proposes to increase available parallelism by issuing instructions from *multiple* threads at once. Instructions from different threads are guaranteed to be independent, so if you have n threads running at once, your number of available instructions for dispatch each clock is improved about n times. Of course, this method has a cost in complexity and area -- now the CPU has to have knowledge of threads, and keep a process table on die. However, provided many threads are run at once, this *greatly* increases the utilization of the processor's resources, and thus the performance of the part.
> Since reading and writing to various RAM caches are the biggest bottlenecks in the current PC architecture, adding more units is just going to lead to increased contention for these resources.
...this is a valid point. SMT particularly increases the burden on instruction fetch and cache, since it is pulling from several different streams at once. However, there are methods that can somewhat compensate for the contention of resources introduced. Now, you have multiple threads available at all times. So, when one thread stalls on a cache miss, the processor can dispatch a different thread to run while the cache miss on the first is being serviced. This effectively hides the latency of the cache miss since the processor is able to do useful work during the service. You see, it's all about keeping those functional units busy.
> So many CPU cycles are wasted with the current generation of software that it seems a bit pointless increasing the number of potential instructions you could perform..
If you believe this, then you should be pro-SMT. SMT doesn't address increasing potential instructions performed per second. Instead, it is an attempt to close the gap between *actual* performance and *theoretical* performance by keeping more of your processor busy.
>you have to question the thinking behind such a modification.
That's true only if you have sufficient hardware registers (not to be confused with architectural registers), and your tasks aren't bottlenecked on memory. If you have truly independent tasks, then each will effectively see half as much cache at all levels of the hierarchy. This can get especially painful in L1I -- if you can't keep the CPU fed from both streams, oops! And, large register files can be a speed limiter in the architecture. (On the plus side, though, the hardware register files can be distinct between the threads, so it's not too bad. As I understand it, it's the unified register files that are the real problem.)
One of the main attractive things I see in SMT is that you can effectively make the pipeline deeper on the architecture while completely hiding it. This is important on VLIW-style architectures that have an exposed pipeline. To make it deeper, you need to somehow hide the fact that it's deeper than the code thinks it is. One way is to add interlocks (gradually making stages of the pipeline protected, rather than exposed). Another is to interleave multiple threads, so that each thread sees the pipeline length it expects, but the actual pipeline is some factor larger.
Of course, in this VLIW world, things can get tricky outside the CPU. Because you lack superscalar issue (that's what VLIW's about), the issue of stalls becomes a problem. "One-stall-all-stall" is an oft-mentioned SMT VLIW technique, and with it, you really need to make sure you aren't bottlenecked on memory before you go down the SMT path. "One-stall-all-stall" means if one SMT thread stalls, all threads stall... As I understand it, it's the "cheap" way to maintain the VLIW state in an SMT VLIW machine, but it also amplifies any memory system bottlenecks you might have.
--Joe "Mr. VLIW"--
Program Intellivision!
Some would argue that EPIC (the ISA for IA64 chips like Itanium) is a fundamental change in hardware design - it combined VLIW (an old idea) with explicit encoding of inter-opcode dependency, among other things. That kind of explicit "helper" data for internal ILP engines could end up proving valuable, if the compiler technology can keep up.
The BOTTLENECK is the memory bus. Refilling the cache from a hose that's 1/10th the speed of the processor. That's why CISC lasted so long in the first place, and is still with us today. CISC has variable length instructions. If you can express an instruction in 8 bits, you do so. 16 for the more complex ones, 32 bits for the really complex ones. So when you're sucking data into the cache 32 bits at a time, you can get 2 or 3 instructions in a 32 bit mouthful. (Or, in the case of pentiums, 64 bits to feed 2 cores, but the principle's the same.) You're optimizing for the real bottleneck with compressed instructions.
The fixed length instructions of RISC can be executed 2 at a time because you don't have to decode the first one to see where the second one starts. But Sparc, PowerPC, and even Alpha haven't displaced Intel because the real bottleneck is the memory bus, and bigger instructions aren't necessarily a win. (That and Intel translates Cisc to Risc inside the cache, and pipelines stuff.)
VLIW as iTanium picked it up sucks so badly because the real bottlneck is sucking data from main memory, and now they want 192 bits of it per clock! For only three instructions, and on average at least one will probably be a NOP. Crusoe has a MUCH better idea, sucking compressed CISC instructions in and converting them to VLIW in the cache (like Pentium and friends do for CISC to RISC).
This multi-threading stuff is just a way to keep the extra VLIW execution cores from being fed NOPs. They don't deal with the real problem, the memory bus de-optimization by reverting to full-sized instructions all the time.
Rob
This is perhaps one of the most useful sites in today's world of technobabble: www.acronymfinder.com. It lists 19 different meanings for "SMT", none of which are Simultaneous MultiThreading! :-)
/ \
\ / ASCII ribbon campaign for peace
x
/ \
IMHO, SMT is a load. Modern microprocessors are mostly cache-starved. SMT puts two processors on the wrong side of the L1$, aggrevating the cache bandwidth problem. Worse, the two processors in SMT degrade referential locality, further degrading the performance of the cache.
You overlook a couple of very important factors.
First of all, it would cost you almost no extra silicon or latency to have duplicate L1 caches, and to add a selection bit to the addresses sent out on memory operations.
Secondly, technologies like SMT help _save_ you when you have a cache miss, because you still have an instruction stream that can execute while one thread's waiting for data.
You might want to doublecheck the terms you're using:
Multiple pipes are a relatively old idea. Ditto instruction-level parallelism, which is one of the analytical quantities used to judge how well multiple pipes will work in a given situation. SMT is a relatively new idea that lets you easily boost the instruction-level parallelism, which in turn makes scheduling and issuing instructions *much* easier.
One of my favorite computer architectures is the CDC 6000 series. It had a Peripheral Processor (PP) that did all of the system I/O. The main CPU crunched numbers while the PP dealt with the outside world. The cool thing about the design of the PP was that it appeared to be 10 independent processors, even though it only had one ALU, instruction decoder etc. This was accomplished by a "barrel" of 10 sets of CPU registers and memory banks. The PP would rotate the barrel every time an instruction was fetched and executed, turning one physical CPU into 10 virtual CPUs. This meant that the PP could simultaneously execute 10 different programs wihout having 10 hardware CPUs. I've often wished there was a microprocessor that could do this. It would be great for embedded real-time systems and I/O controllers. Each I/O device and/or subsystem could have its own virtual CPU, that would never get swiped by other tasks or I/O interrupts.
Mea navis aericumbens anguillis abundat
While in many senses you are right I think you are pointing to the wrong issue. It is not something inherent in the x86 arch that causes problems in scaling it is mostly Intel's SMP bus design. Having all the CPU's share a single, shared, bus between each other and system memory is the bottleneck. I mean, look at the Athalon, it isn't riding on an Intel designed bus, it rides on a DEC desigend EV6, originally made for the Alpha.
While Beowulf is a nifty technology, it does not solve all the scaling problems as you might think. Beowulf clusters are only useful for a specific subset of available problems, stuff that can be easily split up and sent to many, semi-independent, processing nodes. Beowulf clusters are generally connected together with 1Gb or 100Mb Ethernet which does not have high bandwidth or low latency compared to the CPU-Memory bus in even the cheapest computers. I would take a single 128 CPU box over 64 dual proc boxes connected via 1Gb Ethernet (or even Myrinet) any day.
-- Remember: Wherever you go, there you are!
The BOTTLENECK is the memory bus.
OK.
That's why CISC lasted so long in the first place, and is still with us today.
No way. CISC (i.e. x86) lasted so long because of duopoly action and backward compatibility. In fact, like you said, CISC is dead because even since Pentiums, x86 chips have been RISC on the inside and CISC to the outside world (to varying degrees).
The fixed length instructions of RISC can be executed 2 at a time because you don't have to decode the first one to see where the second one starts.
Or n at a time. Any OOO RISC processor these days worth its snot decodes 4 ops/clock, some are at 6 or 8. (If it can't retire that fast, it doesn't really matter...)
Alpha haven't displaced Intel because the real bottleneck is the memory bus
Really? For scientific computing, which is where you have really big datasets and memory bandwidth is key? I don't think you see x86 there very much. You see DEC, IBM, Sun and HP. Who are all, surprise, surprise, RISC-based hardware vendors. Many RISC chips (Alpha, POWER, PowerPC) have long since passed x86 in sheer performance, especially on FP. Intel has defintely won in price/performance, but I would argue that's more due to volume than anything else.
This multi-threading stuff is just a way to keep the extra VLIW execution cores from being fed NOPs.
Umm, Alpha EV8 uses SMT. Not VLIW. Itanium is VLIW-like. Doesn't use SMT. Example no worky.
Not saying that the concept is wrong, that SMT as a concept might alleviate some of the performance issues with superfluous instructions in a VLIW instruction stream. But that's sort of the point of VLIW, to let the compiler, rather than OOO hardware, figure out how to best use the available functional units as much as possible. It puts NOPs in to keep the instruction stream balanced so the decoder can work in a predictable way just like in RISC.
Say what? Most applications can't fill a deep pipe, even out-of-order and with aggressive prefetching. The ways this stuff wins include having two (or maybe more!) instruction streams to crunch, and switching away from the one that's now blocking on a memory access. Prefetch on the other surely completed already ...
The P4 is a good example of a pipeline
that's too long.
And by the way, why has this taken so long to arrive? It's still not something I can purchase yet, and I first heard of it back in 1992. There's something fishy.
Surely this 'virtual-multi-CPU' system can only decrease the sheer number of operations per second a CPU of a given size/speed can do?
The overhead - whether it be in sacrificed MIPS or die area, of distributing instructions among execution units is going to be significant, compared to a maxed-out single core design.
Since reading and writing to various RAM caches are the biggest bottlenecks in the current PC architecture, adding more units is just going to lead to increased contention for these resources.
So many CPU cycles are wasted with the current generation of software that it seems a bit pointless increasing the number of potential instructions you could perform..
Its like putting a 700 cubic inch supercharged W16 engine constructed from 3 straight-8 blocks into a VW Kombi van.
Sure, it'll theoretically go pretty fast, but when its parked by the side of the road 340 days out of the year and only ever driven by a bunch of hippies who are too stoned to see the road properly at 20 kmph, you have to question the thinking behind such a modification.
I gots ta ding a ding dang my dang a long ling long
Yikes. The EV8 will dissipate 250watts! That's more than my monitor! Of course, watts are good ;) I want one (or four, this does SMP right?)
A deep unwavering belief is a sure sign you're missing something...
- It includes 2 tightly coupled processor units
In SMT, you have multiple threads of execution (eg. multiple PCs) feeding one CPU.Seriously though threading like that is kind of at odds with todays very long pipelines (basicly the cost of a thread switch can be very high if you have to fill a deep pipe). With heavily out-of-order systems this can be less of a problem .... but you're still stuck with the problem that if you're using a larger percentage of the CPU's real clocks then you're going to put more pressure on shared resources like caches and TLBs - larger L1s/TLBs are going to potentially hit CPU cycle time and of course these days L2 can take a large percentage of your die size (after all the goal here is to get more usefull clocks/area)
For your bedtime reading, y'all.
--Seen
"I used to be a dilettante. Then I thought I'd try something else for a while."
For those with a technical bent who were disappointed by the lack of information on SMT in the linked artilce, here are some better resources:
Introduction to Simultaneous Multi-threading from UMass .
Quick Quiz on SMT.
Caches for Simultaneous Multithreaded Processors: An Introduction
My feelings shall be vindicated when SMP Athlon machines become readibly available. Their comparatively minor bandwidth advantage will let them blow similarly-clocked Intel boxes out of the water.
Personally, I feel that the best way to scale x86 to supercomputing levels is through clustering, such as is offered by the venerable Beowulf for GNU/Linux. GNU/Linux, for better or worse, is continuing to grow in popularity, and I would like to see commercial software vendors try releasing Beowulf-enabled software for Linux. Imagine being able to buy Oracle for Beowulf! Okay, poor example; Oracle is memory-intensive rather than CPU-intensive, and a RDBMS is one application which is so dependant on a fast disk and good caching that that advantages pale in comparison to the potential problems. What would really be cool are Beowulf ports of statistical analysis and 3D-rendering software. Oooh, yeah... after all, The Matrix and Titanic have both proven the effectiveness of free x86 Unix-workalikes in render farms... I believe that those two movies respectively used FreeBSD and GNU/Linux.
--
--
I like to watch.
The point of SMT is that if one thread gets a cache stall because it has to hit main memory, then another execution thread has its instructions loaded into the CPU. SMT is actually one way to help reduce the CPUMEMORY botttleneck.
Pay particular note to the fact that you can take an existing superscalar chip and add SMT for only about a 10% chip real estate premium, while it should be able to double throughput. That's a lot better than trying to double throughput by adding another CPU to a machine or by adding two cores to a CPU.
Also note that it isn't recommended to run processes with different address spaces simultaneously on the processor because that would thrash the TLBs. Its only suggested that you let multithreaded apps (oracle, perhaps future versions of apache) load more than one thread into the processor at the same time.
Back in the early-mid 70's people were taking two 6800 CPUs, wiring them out-of-phase, and essentially building tightly-coupled SMP systems. We didn't really have threading in those days, or else the correct OS could have made such a system SMT, instead.
But someone else was designing another CPU, called the COPS. They looked at this well-known out-of-phase 6800 technique, and realized that their design basically used clock-up for fetch/decode, and clock-down for execute. During each half-cycle, half the CPU was sitting idle.
So they doubled the registers, using the fetch/decode unit with one register-set during clock-down and the other register-set during clock-up. The execute unit worked in the converse fashion, alternating register sets. A dual CPU on a single chip for the cost of a second register set and a little control/arbitration logic. They didn't attempt any sophisticated contention-prevention, leaving that up to the software. This was mid-late 70's.
With more modern software, COPS might have been the first SMT. I don't know the timeframe of the CDC6000, whether it beats mid-late 70's or not.
The living have better things to do than to continue hating the dead.
Well, maybe not by the time I get this up. Actually, win98 and ME don't support dual processors at all, so you're second one will just be sitting on the motherboard turned off.
As far as SMT goes, I think it's a good idea (well, obviously, why wouldn't it be). You really can only get so much out of Instruction level parallelism, and I've always thought that splitting CPU time up by thread rather then by instruction parallelism would be a lot more effective.
Rate me on Picture-rate.com
"and dear god does this website suck now." -- CmdrTaco
actualy, every pic on there is 'default' Right now, my pic is at http://picture-rate.com:8080/hello/viewpic.jsp?sys tem=happy&owner=default&picture=i170, although, that could cange. I would link directly to it, but chad hasn't put in the ablity to select which picture to rate, otherwise I owuld put that link in my sig.
Rate me on Picture-rate.com
"and dear god does this website suck now." -- CmdrTaco
Comment removed based on user account deletion
Comment removed based on user account deletion
Actually, while the first bit is true... let's face it, for most business applications we do not need faster machines until we have to deal with the bloat of the next road of software from the major vendors.
Businesses could probably do very well on a single standardized set of software for a decade or more for most common functions. Many have done so as a matter of fact. There are some businesses out the still running win 3.1 apps.
"It is a greater offense to steal men's labor, than their clothes"
Ah, At long last circuitry catches-up with functionality that women have been laying claim to for aeons.
People who say it cannot be done should not interrupt those who are doing it.
- SMT = Society for Music Theory
- SMT = Surface Mount Technology
- SMT = Simultaneous MultiThreading
And (drumroll) of course(http://smt.ucsb.edu/smt-list/smt-main.html)
the Finland Travel Bureau (http://www.smt.fi/)
When will the madness end?
People who say it cannot be done should not interrupt those who are doing it.
This is because Win9x does not support SMP so even dual 933MHz will be out performed by a single 1GHz. A better comparison would with Win2k or Linux, which will both actually use both CPU's.