Emergence of SMT

← Back to Stories (view on slashdot.org)

Posted by Hemos on Wednesday March 14, 2001 @05:45PM from the interesting-approach-to-things dept.

yellow writes "SMT, or Simultaneous Multithreading, is a concept that is rapidly gaining adherence in the microprocessor area. It essentially allows for a single processor with multi-processor capabilities in both ILP (Instruction Level Parallelism) and TLP (Thread Level Parallelism). When comparing SMT vs. dual or multiprocessor performance data, it is important to compare apples to apples, and understand why under an OS such as Win 98 or ME, even a single Pentium III 1 GHz will handily outperform a dual-Pentium III 500 MHz setup. This is the discussion topic of a new feature on HWC."

3 of 104 comments (clear)

Min score:

Reason:

Sort:

Time for some education on computer architecture.. by slothbait · 2001-03-14 14:18 · Score: 5

> SMT? Blow it out your ass.

Rather unnecessary, don't you think? I don't think you would be quite so hostile to SMT if you had a better understanding of it.

Time to do some informing...

> Surely this 'virtual-multi-CPU' system can only decrease the sheer number of operations per second a CPU of a given size/speed can do?

This statement doesn't make much sense. I think you mean that a CPU which spent the die area on additional functional units as opposed to SMT support could achieve a greater MIPS value. This is true, but only for theoretical MIPS. Simply adding more functional units to modern chips, would *not* improve actual performance. Explanation follows...

> The overhead - whether it be in sacrificed MIPS or die area, of distributing instructions among execution units is going to be significant, compared to a maxed-out single core design.

So you are stating that implementing SMT comes at a cost in die area? Of course it does, but the important point is that using that die area instead to add more conventional execution units would *not* increase the performance of the processor. Why, you ask? There is a limited amount of instruction level parallelism available in a single thread of program execution. Current wide-issue superscalars get something on the order of 2.3 dispatches per clock, despite the fact that they have the *capability* to issue far more. The processor simply can not find enough independent instructions to keep it's functional units full. If memory serves, the Athlon is a 9-issue core. You could add functional units up to 12-issue or more, but your actual dispatch rate would still be around 2-3 per clock. While your theoretical performance would increase your actual performance would remain stagnant.

So, current software does not exhibit enough parallelism to keep the functional units in even current processors busy. SMT proposes to increase available parallelism by issuing instructions from *multiple* threads at once. Instructions from different threads are guaranteed to be independent, so if you have n threads running at once, your number of available instructions for dispatch each clock is improved about n times. Of course, this method has a cost in complexity and area -- now the CPU has to have knowledge of threads, and keep a process table on die. However, provided many threads are run at once, this *greatly* increases the utilization of the processor's resources, and thus the performance of the part.

> Since reading and writing to various RAM caches are the biggest bottlenecks in the current PC architecture, adding more units is just going to lead to increased contention for these resources.

...this is a valid point. SMT particularly increases the burden on instruction fetch and cache, since it is pulling from several different streams at once. However, there are methods that can somewhat compensate for the contention of resources introduced. Now, you have multiple threads available at all times. So, when one thread stalls on a cache miss, the processor can dispatch a different thread to run while the cache miss on the first is being serviced. This effectively hides the latency of the cache miss since the processor is able to do useful work during the service. You see, it's all about keeping those functional units busy.

> So many CPU cycles are wasted with the current generation of software that it seems a bit pointless increasing the number of potential instructions you could perform..

If you believe this, then you should be pro-SMT. SMT doesn't address increasing potential instructions performed per second. Instead, it is an attempt to close the gap between *actual* performance and *theoretical* performance by keeping more of your processor busy.

>you have to question the thinking behind such a modification.
SMT != ILP != multiple pipes. by Christopher+Thomas · 2001-03-14 13:16 · Score: 5
I thought that the major processor companies had been working with multiple execution pipelines for years now. Doesn't that fall under the category of ILP?

You might want to doublecheck the terms you're using:
- "ILP" is "instruction-level parallelism". It's not a physical part of the chip - it's a quality of the instruction stream. ILP is the number of instructions (usually average) that could theoretically be executed at one time, without violating data relationships within the program. Modern processors _can_ execute multiple instructions per clock because the ILP of most programs is greater than one (i.e. there are usually multiple instructions that can be executed without violating data or control dependencies).
- "Multiple pipes" is part of the hardware that allows processors to issue multiple instructions per clock. As the name implies, this represents multiple hardware units that are capable of performing operations independently of each other.
- "SMT" is "Symmetrical Multithreading". Remember how back under ILP, I said that the number of instructions that can be issued per clock depends on the parallelism of the program being run? SMT boosts the parallelism by running two threads at the same time and interleaving their instructions (more or less). As the instructions from different threads usually don't care what the other threads are doing, this gives you many more instructions that can be executed at the same time (assuming you have enough hardware to execute them).
Multiple pipes are a relatively old idea. Ditto instruction-level parallelism, which is one of the analytical quantities used to judge how well multiple pipes will work in a given situation. SMT is a relatively new idea that lets you easily boost the instruction-level parallelism, which in turn makes scheduling and issuing instructions *much* easier.
Re:"Bollocks" ? by Christopher+Thomas · 2001-03-14 13:26 · Score: 5

It's not true that doubling L1$ and adding a selection bit costs you nothing. In fact, the size of L1$ is rather limited, and cutting size in half substantially increases the miss rate. It is also fairly expensive to add selection bits.

Um, no.

Most of your die is taken up by the _L2_ cache. You have plenty of space to add more L1 cache. The reason you usually don't is that a larger L1 cache served by the same set of address lines has longer latency. Two independent duplicates of an L1 cache will behave identically to the original L1 cache.

Performing the selection adds latency, but this can be masked because you know the value of the selection bit long before you know the value of the address to fetch.

In fact, you'd almost certainly _reduce_ the cache load compared to a single-threaded processor capable of issuing the same number of loads per clock, because they'd be hitting different caches, and you wouldn't have to multiport.

SMT also doesn't save you from cache miss latency. Out-of-order instruction issue saves you from that.

SMT, in any sane design, is used on an OOO core. An OOO core won't save you if your next set of instructions has a true dependence on the value being fetched from memory. SMT gives you a second thread with no data dependence on the stalled load, and hence plenty of instructions in the window that you can execute while waiting.

I'm having trouble seeing where your arguments are coming from. As far as most of the core's concerned, there's still only one (interleaved) instruction stream, just with less data dependence in it. This is scheduled and dispatched as usual.