Is SMT In Your Future?

Intel IXP1200 already does this (kinda) by profesor · 2000-12-28 03:46 · Score: 3

Intel's IXP1200 network processor already does something like this. It's a very spiffy little processor - one of these running at 166MHz can route IP packets at a Gb/s. Though you need to explicitly code your microcode this way, it's definitely not hidden from you like the Alpha chip will hide it.

Sounds like it may be time to make a new benchmark to cover multi-threaded processors.

Re:Just one question... why? by Greg+Lindahl · 2000-12-28 03:25 · Score: 3

Just because benchmarks are single threaded doesn't mean that they can't benefit from multiple execution units. Typical chips today (Pentium III, Alpha) have a lot more than 1 exectuion unit, and get a benefit from it most of the time.

The benefit of SMT over N smaller cpus is flexability: A program that can use the entire chip at once is damn fast, or several programs can share it.

IBM's SMT by roca · 2000-12-28 05:03 · Score: 3

I've been told by an architecture grad student friend of mine, who should know, that IBM has an AS/400 system using a PPC-like core that does SMT.

Working on an architecture that does this... by barracg8 · 2000-12-28 06:35 · Score: 3

I'm involved in a project involving a SMT (well, more CMT) processor design. We do not yet have any silicon, but are getting good results in simulation.

We have a simulator that can be set to simulate a processor with any number of closely coupled cores, and any number of threads per core. We get good results at a 8 core * 4 threads setup (total up to 32 way parallel).

Using some basic automatic parallelization on a piece of code designed to run in a single thread, we have generated up to a 26X speedup, 8 core * 4 threads versus 1 core * 1 thread.

The advantage of SMT over a normal processor is that it makes use of clock cycles that would otherwise be wasted, eg waiting for the cache to fill. If your architchture spends half of its time stalled, and you can make use of these cycles by adding SMT, then you can increase your processor performance very efficiently.

SMT basically requires you to duplicate all of the processor's registers n times (n = #threads), + a little extra hardware ('little', relative to duplicating the entire core). So for ((1 * core) + (2 * registers) + SMT hardware) you are getting the performance of ((2 * core) + (2 * registers)). Good bang per buck ratio, when you count up the transistors.

But SMT naturally gives you diminishing returns for each thread you add - the whole point is that each new thread is using up wasted cycles - and once you reach ~4 threads there are very few cycles left over. At this point, if you have room left over on the die, you may as well start thinking about SMP on the same die.

Surpriesd the article didn't mention SMT & AMD. Check out this link.

Re:Just one question... why? by jmaessen · 2000-12-28 05:22 · Score: 3

A lot of people are assuming that multiple processors can be put on the same die for "equal or less cost". This simply isn't true.

Sharing the cache is hard

Cache is the vast majority of chip area in a modern processor; as others have pionted out, it's obvious that multiple processors should share a cache. However, this is difficult. The problem is that every load/store unit from every processor must share the same cache bandwidth.

Thus, for a 2-way chip with only a shared cache, memory latency---to the cache, the best possible case---is cut in half.

We can work around this by using various tricks up to and including multiported caches---but most of these tricks increase latency (lowering maximum clock speed) or require much more circuitry in the caches (we were sharing the cache because it was so big, remember?).

It makes much more sense to share the circuitry that feeds into the cache.

Those are the superscalar execution units! Thus, SMT.

Utilization

Instead of keeping half the execution units busy, we attempt to keep them all busy. Extrapolating very roughly from Figure 2 we can expect to issue about half as many instructions as we have issue slots (actually less if we have a lot of execution units). The basic idea is we can cut the number of empty issue slots in half each time we add a new thread. Further, instructions from separate threads do not need to be checked for resource overlaps---this circuitry is the main source of complexity in a modern processor.

What's happening now has been predicted for a long time. The extra resources (a bigger register set, TLB, extra fetch units) required for multithreading are now cheaper than the extra resources you'd need (mostly pipeline overlap logic) to get a similar increase in single-threaded performance.

SMT easier than SMP?

Moving thread parallelism into the processor is actually easier for the compiler and programmer; the weak memory models implied by cache coherence models aren't an issue when threads share exactly the same memory subsystem.

To get an idea for how hard it is to really understand weak memory models, consider Java (which actually tries to explain the problem to programmers---in every other language you're on your own). Numerous examples of code in the JDK and elsewhere contain an idiom---double-checked locking---which is wrong on weakly-ordered architectures. What's this mean? Your "portable" Java code will break mysteriously when you move it to a fast SMP. Alternatively, you will need to run your code in a special "cripple mode" which is extremely slow.

From a programmer's perspective, SMT (as opposed to SMP) architectures will be a godsend.

Just one question... why? by El · 2000-12-28 03:22 · Score: 3

Rather than run multiple simultaneous threads on a single massively complicated CPU with 8 instruction units, why not simply put 8 very simple CPUs on the same die (at equal or less cost) and just run SMP? Why is SMT considered a "win", when most benchmarks are single-threaded anyway? Seems like we're moving in the direction of complexity for complexity's sake here...

--

"Freedom means freedom for everybody" -- Dick Cheney

Re:Just one question... why? by john@iastate.edu · 2000-12-28 03:29 · Score: 3

Right, but the major stumbling block to just throwing more execution units at a CPU and letting the CPU and/or compiler schedules them is that after about 4 units or so you run out of work you can schedule from a single thread (the old "every fifth instruction is a branch bugaboo").
So DEC's idea was, hell, grab some work from some other thread and do that.
Pretty cool, IMO.

--
Shut up, be happy. The conveniences you demanded are now mandatory. -- Jello Biafra
Re:Just one question... why? by aanantha · 2000-12-28 04:41 · Score: 3

Having a 4-way SMT single CPU is a lot cheaper than 4 separate processors. Basically, you can think of it as a bridge between SMP and single CPU. There aren't enough applications out there that utilize SMP enough for most people to want to spend money on multiple processors. And because so few people use multi-processors, there haven't been enough application developers willing to make their code multithreaded. A typical catch-22 situation. But supporting multiple register sets on a single CPU doesn't cost all that much, and there are already multiple functional units on superscalar processors.

So that brings us to a second reason. Wide-issue superscalar processors end up using very little of that issue width most of the time. You just can't get enough parallelism out of single threaded applications. SMT offers the ability to use that wasted issue width by scheduling different threads onto the wasted functional units.

A third benefit to SMT is that it drives the industry in the right direction. Writing code to take advantage of SMT is basically the same as for SMP. You want to find ways to break your application into separate threads. If SMT becomes a common feature on CPU's, then perhaps we'll have lots more SMP-favorable code. There will also be greater incentive to write efficient parallelizing compilers. CMP is more efficient that SMT for high levels of parallism. So in future, people will probably be moving from SMT to CMP.

It's true that SMT doesn't help with standard single-threaded benchmarks. That's probably what's delayed the industry in adopting it. But the industry is finding out that it's running out of ways to speed up processors. Increasing clock rate isn't enough because your memory latency becomes a greater bottleneck. So increased parallelism becomes more and more crucial

How is this different from Tera MTA ? by bmajik · 2000-12-28 04:02 · Score: 3

Many of you may not be familiar with TERA, the seattle based super computer company that bought CRAY from SGI and then renamed themselves CRAY for marketing reasons.

Tera's home-brew supercomputers used what they called the TERA MTA - Multi-threaded architecture processors. You could get a 4 proc MTA machine that would significantly outperform much larger super computers.

Essentially the MTA cpu has knowledge of 128 virtual threads of execution inside of it. AFAICT, the point of the MTA design, and apparently of this one, is to minimize the penalty for branches, context switches, etc, wherever possible by putting fine grained execution knowledge in the CPU itself.

Given that superscaling has reached its limit and superpipelining is getting nastier and nastier, this might be a good way to go. Apparently Tera gets great numbers with their MTA stuff.

--
My opinions are my own, and do not necessarily represent those of my employer.

Re:How is this different from Tera MTA ? by Greg+Lindahl · 2000-12-28 04:43 · Score: 4

The Tera MTA requires a compiler to multi-thread all processes. You only get 1 functional unit (and huge latencies == terrible speed) if your program can't be transformed by the compiler.

SMT, in contrast, can work on programs which can't be multi-threaded by a compiler. It works on "instruction level parallelism" (ILP). This is a much finer grain than parallelism that a compiler can find and exploit with another thread.

Other similar concepts by rjh3 · 2000-12-28 04:48 · Score: 3

There have been a variety of real world experiences with multi-threaded CPUs. Two of the more interesting are:

The Denelcor HEP. Only a few were made, and this dates way back to 1985, but it was a really neat multi-threaded CPU. It ran a variety of Unix, and had some reasonable extensions to adapt Fortran (even now probably the most popular number crunching language) to the multi-threaded CPU world.

The Alewife project at MIT. A variety of interesting ideas. Nothing ever really past the prototypes was finished to my knowledge. The concepts of operation are fun to examine.

These are an interesting complement to the SMP approach.

250Watts!!!! by pallen · 2000-12-28 03:52 · Score: 3

I know its slightly OT but it says in the article that each of these babies will consume 250 Watts. Thats obscene. People run 8 processor boxes of these as well, so 2KW just for the processors. You could heat a swimming pool with that.
--------
Make something idiot proof and someone will make a better idiot.

Yes, definitely... by Mtgman · 2000-12-28 03:36 · Score: 3

I intend for SMT to be an integral part of my daily life from here on. I just can't spend a day without thinking of all those beautiful babes doing all those naughty things... Oh, and you've done your usual crappy job of editing, there's supposed to be a U in that word.

Steven

--
-- I have marked myself unwilling to moderate-- I don't have other accounts to artificially inflate the karma of

Re:Finally, a chip you can *really* fry eggs on! : by muck1969 · 2000-12-28 06:25 · Score: 3

Add a heatsink as a grill and give it a 10 degree tilt with a fat collection tray ... get a half-witted aging sports star to endorse it and a funky name ...

" ... grandmas want them! college kids want them! ... " etc.

--
m.mmm..myyy ... sssissxxxtthh bbboottle offf mmmmmoouunnnttain ddeeewww.. in thhe pppassst ffffif

Re:Time to kick a friend's ass by tietokone-olmi · 2000-12-28 04:13 · Score: 4

Actually, that's pretty much what the Pentium Pro (ergo p2, p3, celeron, celeron2 and the p4) do - only there it's done using "virtual registers" which means that the register "eax" can map to a completely different physical register if the instruction scheduler needs it to.

For example, you could write your code like this:

mov ebx, Pointer
mov cx, [ebx]
mov eax, [ebx+cx]
mov Pointer2, eax

(now I'm pretty sure that's not the best way to do it - it's just an example, ok?)
Now, if you have another multi-instruction operation after this and it's going to use any of the registers used above, the CPU will see in the decoding phase that "a-ha! eax has received a value that doesn't depend on itself (i.e. a completely new value)" and will assign a different physical register to "eax" until it's overwritten again. (this is also the reason who xor reg,reg is not the preferred way of clearing a register on the ppro and up.) Same for ebx and ecx and the other regs. By the time the CPU is finished decoding these instructions (this would take 1 and 1/3 cycles for ppro through p3 and 1 cycle for the p4 (due to the 4-1-1-1 decoders)), the reorder buffer (that receives the decoded instructions, also called micro-ops or uops) will have been filled up with previously decoded instructions and will be able to put as many uops into the execution "ports" as possible (3 per cycle in ppro through p3, not sure about the p4).

This, of course, assumes that the code is organised so that the decoders can feed the reorder buffer with more than 3 micro-ops per decoding cycle, so that there's something to reorder. But this will, for the most part, take care of that data-dependency problem.

Personally, I prefer explicit register setting (a'la PowerPC, 32 int regs + 32 fp regs) so that the CPU won't have to schedule instructions for me...

(all this information, except for the p4 decoder uop-max series, comes from the excellent pentopt.txt file.)

Because .... by taniwha · 2000-12-28 04:28 · Score: 4

the problem you're trying to solve is the long latencies to main memory and the fact that when the CPU is idle for long periods when it has to wait for them. Basicly if you've gone to the trouble of building a cool OO cpu with register renaming, scoreboards etc etc then setting it up with and extra PC and the hardware to manage an extra thread is (theoretically) relatively easy - doing it for something like an X86 with state up the wazoo is probably rather harder.

Having gone down the route of doing a paper design for an SMT I know that one of the real problems with SMT in traditionally piped CPUs (ie non-OO) is that with today's deep pipelining the cost of thread switches is really high - often to the point of being useless.

The alternative (SMP) is good for other reasons - you can potentially reduce the size synchonous clock domains on a dies - design time may be lower (build one and lay out 8). The downsides have to do with memory architectures (cross bars, buses, cache paths etc)

Billion Transistor Chips by Heretic2 · 2000-12-28 05:06 · Score: 5

I read a very good paper while taking Systems Architecture at UT by Dr. Berger that he wrote while he was at Wisconsin. They simulated three different billion transistor architectures:

Massively Parallel/Pipelined ala today's processor
SMT
Multiple simple core on-die

The MSC (forgot the real abreviation, but that's what I'm going to call it) architecture had 4 simple, identical cores. Each core was about somewhere between a Pentium and a K6 in terms of complexity--lean on scheduling logic, heavy on executive hardware--each with an independent, decent sized L1 cache. The MSC chip had a large on-die L2 cache quad-ported or oct-ported that all processing cores could access quickly and simultaneously, and a fat L3 cache to boot on-die. It also contained some special context caching mechanism.

The cores are actually able to execute in different contexts as well, not just within the same context as with SMT. This opens up parallization across more than one process.

One of the more interesting problems in a billion transistor chip is the wire delay. With processes so small that a billion transistors can be put on a moderate sized die, the clock rate is so high that the wire delay from one side of the chip to another side can be over 100 clock cycles! So locality of information becomes extremely important. With multiple, simple processing cores, all the logic for the pipeline is close together. The data is readily available in L1 cache. The scheduling logic has been mostly handled outside the cores, all they have to do it crunch numbers within their context as fast as possible. They don't have to worry about sending/receiving signals from very far on the chip and the resultant delay, so everything is local and fast.

Additionally, it's the least complex chip to design. Only one processing core needs to be designed and tested since it's duplicated 4 times. The core is much simpler than other designs. The scheduling logic is all much simpler and easier to test. Most of the die space is devoted to localized caches and executive units, not scheduling logic.

In the benchmarks the SMT and MSC processors vastly outperformed a convential massively pipelined/parallel billion transistor processor. And the MSC performed an additional 20+% (on average) than the SMT processor.

On top of all that, to get the best performance from SMT processors you need very smart compilers that are able to find parallelizable code and generate the binary for such. With MSC this isn't a problem. It'll run multi-threaded code simultaneously, but it'll also run multiple processes or any combination of both processes and theads simultaneously without help from smart compilers.

Ryan Earl
Student of Computer Science
University of Texas

17 of 119 comments (clear)