Is SMT In Your Future?
Dean Kent writes "Simultaneous MultiThreading is a method of getting greater throughput from a processor by essentially implementing multi-tasking within a single CPU. Information on Compaq's plans for SMT in the EV8 can be found in thisan an article and thisand this article. Also, there is some speculation that Intel's Foster CPU (based upon the Willamette core) will also have SMT and that the P4 may even have the circuitry already included, as discussed briefly in forums."
Actually, that's pretty much what the Pentium Pro (ergo p2, p3, celeron, celeron2 and the p4) do - only there it's done using "virtual registers" which means that the register "eax" can map to a completely different physical register if the instruction scheduler needs it to.
For example, you could write your code like this:
mov ebx, Pointer
mov cx, [ebx]
mov eax, [ebx+cx]
mov Pointer2, eax
(now I'm pretty sure that's not the best way to do it - it's just an example, ok?)
Now, if you have another multi-instruction operation after this and it's going to use any of the registers used above, the CPU will see in the decoding phase that "a-ha! eax has received a value that doesn't depend on itself (i.e. a completely new value)" and will assign a different physical register to "eax" until it's overwritten again. (this is also the reason who xor reg,reg is not the preferred way of clearing a register on the ppro and up.) Same for ebx and ecx and the other regs. By the time the CPU is finished decoding these instructions (this would take 1 and 1/3 cycles for ppro through p3 and 1 cycle for the p4 (due to the 4-1-1-1 decoders)), the reorder buffer (that receives the decoded instructions, also called micro-ops or uops) will have been filled up with previously decoded instructions and will be able to put as many uops into the execution "ports" as possible (3 per cycle in ppro through p3, not sure about the p4).
This, of course, assumes that the code is organised so that the decoders can feed the reorder buffer with more than 3 micro-ops per decoding cycle, so that there's something to reorder. But this will, for the most part, take care of that data-dependency problem.
Personally, I prefer explicit register setting (a'la PowerPC, 32 int regs + 32 fp regs) so that the CPU won't have to schedule instructions for me...
(all this information, except for the p4 decoder uop-max series, comes from the excellent pentopt.txt file.)
The Tera MTA requires a compiler to multi-thread all processes. You only get 1 functional unit (and huge latencies == terrible speed) if your program can't be transformed by the compiler.
SMT, in contrast, can work on programs which can't be multi-threaded by a compiler. It works on "instruction level parallelism" (ILP). This is a much finer grain than parallelism that a compiler can find and exploit with another thread.
Having gone down the route of doing a paper design for an SMT I know that one of the real problems with SMT in traditionally piped CPUs (ie non-OO) is that with today's deep pipelining the cost of thread switches is really high - often to the point of being useless.
The alternative (SMP) is good for other reasons - you can potentially reduce the size synchonous clock domains on a dies - design time may be lower (build one and lay out 8). The downsides have to do with memory architectures (cross bars, buses, cache paths etc)
- Massively Parallel/Pipelined ala today's processor
- SMT
- Multiple simple core on-die
The MSC (forgot the real abreviation, but that's what I'm going to call it) architecture had 4 simple, identical cores. Each core was about somewhere between a Pentium and a K6 in terms of complexity--lean on scheduling logic, heavy on executive hardware--each with an independent, decent sized L1 cache. The MSC chip had a large on-die L2 cache quad-ported or oct-ported that all processing cores could access quickly and simultaneously, and a fat L3 cache to boot on-die. It also contained some special context caching mechanism.The cores are actually able to execute in different contexts as well, not just within the same context as with SMT. This opens up parallization across more than one process.
One of the more interesting problems in a billion transistor chip is the wire delay. With processes so small that a billion transistors can be put on a moderate sized die, the clock rate is so high that the wire delay from one side of the chip to another side can be over 100 clock cycles! So locality of information becomes extremely important. With multiple, simple processing cores, all the logic for the pipeline is close together. The data is readily available in L1 cache. The scheduling logic has been mostly handled outside the cores, all they have to do it crunch numbers within their context as fast as possible. They don't have to worry about sending/receiving signals from very far on the chip and the resultant delay, so everything is local and fast.
Additionally, it's the least complex chip to design. Only one processing core needs to be designed and tested since it's duplicated 4 times. The core is much simpler than other designs. The scheduling logic is all much simpler and easier to test. Most of the die space is devoted to localized caches and executive units, not scheduling logic.
In the benchmarks the SMT and MSC processors vastly outperformed a convential massively pipelined/parallel billion transistor processor. And the MSC performed an additional 20+% (on average) than the SMT processor.
On top of all that, to get the best performance from SMT processors you need very smart compilers that are able to find parallelizable code and generate the binary for such. With MSC this isn't a problem. It'll run multi-threaded code simultaneously, but it'll also run multiple processes or any combination of both processes and theads simultaneously without help from smart compilers.
Ryan Earl
Student of Computer Science
University of Texas