Is SMT In Your Future?
Dean Kent writes "Simultaneous MultiThreading is a method of getting greater throughput from a processor by essentially implementing multi-tasking within a single CPU. Information on Compaq's plans for SMT in the EV8 can be found in thisan an article and thisand this article. Also, there is some speculation that Intel's Foster CPU (based upon the Willamette core) will also have SMT and that the P4 may even have the circuitry already included, as discussed briefly in forums."
- Massively Parallel/Pipelined ala today's processor
- SMT
- Multiple simple core on-die
The MSC (forgot the real abreviation, but that's what I'm going to call it) architecture had 4 simple, identical cores. Each core was about somewhere between a Pentium and a K6 in terms of complexity--lean on scheduling logic, heavy on executive hardware--each with an independent, decent sized L1 cache. The MSC chip had a large on-die L2 cache quad-ported or oct-ported that all processing cores could access quickly and simultaneously, and a fat L3 cache to boot on-die. It also contained some special context caching mechanism.The cores are actually able to execute in different contexts as well, not just within the same context as with SMT. This opens up parallization across more than one process.
One of the more interesting problems in a billion transistor chip is the wire delay. With processes so small that a billion transistors can be put on a moderate sized die, the clock rate is so high that the wire delay from one side of the chip to another side can be over 100 clock cycles! So locality of information becomes extremely important. With multiple, simple processing cores, all the logic for the pipeline is close together. The data is readily available in L1 cache. The scheduling logic has been mostly handled outside the cores, all they have to do it crunch numbers within their context as fast as possible. They don't have to worry about sending/receiving signals from very far on the chip and the resultant delay, so everything is local and fast.
Additionally, it's the least complex chip to design. Only one processing core needs to be designed and tested since it's duplicated 4 times. The core is much simpler than other designs. The scheduling logic is all much simpler and easier to test. Most of the die space is devoted to localized caches and executive units, not scheduling logic.
In the benchmarks the SMT and MSC processors vastly outperformed a convential massively pipelined/parallel billion transistor processor. And the MSC performed an additional 20+% (on average) than the SMT processor.
On top of all that, to get the best performance from SMT processors you need very smart compilers that are able to find parallelizable code and generate the binary for such. With MSC this isn't a problem. It'll run multi-threaded code simultaneously, but it'll also run multiple processes or any combination of both processes and theads simultaneously without help from smart compilers.
Ryan Earl
Student of Computer Science
University of Texas