A couple of years back the network in our department started to act very sluggishly, with random connection losses, etc. After several days of router reboots and tracking the problem down, the IT guys found that some student was running a hacked-up version of routed on his or her Linux box (presumably as part of a research project). This version, was, shall we say, not playing nice with the rest of the routers.
You're comparing one SMT CPU to two regular CPUs (on the same chip or not.)
Yes, but I'm assuming equal (or as near equal as is practically possible) execution resources.
So what's SMT good for? It doesn't cost nearly as much as two whole CPUs.
See, now that's where I disagree. In terms of raw transitors you are right. But you are forgetting the cost of design and slippage in time-to-market. It is less costly to design and fabricate an MP design from previously designed cores than it is to take such a core and modify it for SMT.
Realize that I am not saying this should never be done. Just that more evaluation is needed. Some of that evaluation will involve silicon, probably in the consumer market.
but relative to a single non-SMT CPU on the same amount of silicon, I think's a almost always a win.
If you look at raw transistor count, I might agree with you. I depends greatly on the architecture and the expense of the duplicated vs. shared resources of the two designs (decode logic, etc.). SMT has a cycle time impact and you have to balance that against the extra transistors required for a CMP.
You add more of everything else, and share the execution units.
This is an argument I've never understood. The execution units are such a tiny, tiny part of the die that I don't see much benefit in sharing them. Sharing the decode/O-O-O logic seems more beneficial, but even an SMT requires more of that (in terms of bandwidth).
Decode is pretty much trivial on anything but an IA32.
Sure about that? The POWER architecture is pretty complex. Even on a MIPS-like machine, the rename and dependency logic complexity rises rapidly with fetch/issue width. As does the wakeup logic with larger instruction windows.
ILP and scheduling have everything to do with cycle time. You can trade off one to get the other, and have a CPU that gets the job done in the same amount of time.
Time = instructions * CPI * clock period
So raising the clock speed (smaller clock period) has the same effect as decreasing the average number of cycles per instruction (more ILP).
My point is that raising ILP with SMT can decrease the clock speed (increase the cycle time). So you get more ILP but everything runs more slowly. A CMP with an equivalent number of contexts should get the same ILP without the extra cycle time penalty (ignoring messaging and coherence overhead). SMT does not make any one thread run faster. It increases throughput, which is exactly what a CMP does.
Pipelining is not a panacea, either. There is a limit to how deep you want to make your pipe. This is one of the reasons good branch prediction is so important -- it allows a longer pipe.
Remember also that cycle time is pretty much the only thing companies have to market, everything else (i.e. number of threads) being equal. Is this unfortunate? Clearly. But it is reality and engineers need to deal with it when making design decisions. Which reminds me to plug the excellent book The Soul of a New Machine by Tracy Kidder, a fascinating account of Data General's race to kill the VAX. There's a bit of discussion devoted to market times and perfect designs.
Two small windows on two threads will find more instructions to run than one large window on one thread.
A CMP has two small(er) windows for two threads. An SMT has one big window for two threads. Two smaller windows should run faster. Whether they find more ILP is an open question. SMT does have the advantage that it can trade off window space, etc. between threads. I think this is more difficult to do than most people realize due to the challenges with fetch policy.
In any event, I am extremely curious to see what happens with the SMT chips coming out. Let's sit back and see if they make it!:)
Actually, unless you want to take a moderate speed hit from recycling external bus protocols internally, you'll have quite a bit of design work on your hands building the internal communications bus for a multi-core system.
I don't know exactly what you mean by "recycling external bus protocols." Certainly you'd want to design the core interface to be efficient in a single-die environment, but it seems to me that the fundamental protocols (cache coherence, etc.) are the same. But I'm not an MP expert, so you probably have more insight on this than I do.
The fetch/decode hardware is a straight duplication of the existing hardware - it doesn't take up any more space than for two duplicated cores.
Not true. The SMT has to worry about fetch policy. This is not a trivial problem to solve. Starvation is a real concern here. Two independent cores don't need to worry about the fetch stream of one interfering with that of the other.
Decode is not a trivial problem, either. IA32 has bug problems with this. My (and others') guess is that's why a trace cache was put on the P4. It's decode cache!
Then there are problems with a large, muti-ported L1 instruction cache. Or interleaved fetch, which gets back to the first problem.
The architectural register file is in two independent banks; again, no more space than you'd have normally.
Except for the additional routing logic and wiring.
Your physical register file is probably in the form of distributed reservation stations on the functional units; again, no more space than for duplicate cores.
First off, pet peeve of mine. This is not directed to you personally, but to the computer architecture community in general. It's function (or execution, etc.) unit, not functional unit. I would hope all our execution units are functional.:)
As for the physical file, distributing it implies a non-uniform register acces time a la the 21264. It's not an impossible problem to handle but there is a penalty with such a large file. Register caching can help with this and it may not be a large concern in the end. More study is needed in this area.
If necessary, you can trade off bandwidth and latency when building the thing
Certainly. This is why engineering is fun.:) The point of my post is that SMT is not a guaranteed win. It might be beneficial in some situations, but not all. I'm not sure it's justified for a POWER-class machine.
Renaming hardware is manageable. You'd need the bandwidth anyways on a wide-issue single-thread processor with the same issue rate.
Ah, but there is no single-thread machine that has the bandwidth of the proposed SMT schemes. Building a 4-way machine is a challenge. 8-way should be much more challenging.
We're reaching the point of diminishing returns with cache size anyways, so you'll probably still have enough cache to effectively handle both threads.
That's not true on server-class machines. Capacity is still a problem. This is why we see superlinear speedup on MP systems. The extra cache on each chip makes the machine as a whole run faster than you would expect, given the number of processors. SMT takes this distributed cache and puts it in one big array. This will slow it down in one way or another.
Re. clock speed, again, you don't mind a _moderate_ amount of extra latency, because you have enough parallelism to reschedule around it.
Eh? ILP has nothing to do with cycle time, save the impact on cycle time that complex O-O-O hardware can have. One cannot "get around" a slow clock through scheduling. Pipelining can be used, but that has its own costs.
You also can get away with a smaller instruction window, because you won't have to work as hard to find independent instructions. This saves latency in the scheduler.
Hmm...maybe. But you argue above that any extra latency in the system can be masked by the O-O-O engine. To me, this implies a larger window. Eventually one thread or another is going to get backed up waiting on memory. When that happens you have to have room available to fetch from the other threads. Has anyone done any studies of instruction queue utilization in SMT? I'd like to know how often the queue is full and how big it has to be to sustain execution. I seem to recall some of the work out of U. Wash. doing this, but I can't find the reference at the moment.
All this is not to say that STM is worthless. Far from it. In fact, the fast thread context switching allows some super-cool techniques not previously possible. I'm trying to temper the enthusiasm for SMT a bit. Think of me as a Devil's Advocate.:)
My impression is that the Power4 is going to have two cores, but I haven't been following it closely, so I could easily be wrong about that.
It has two cores for redundancy and error-checking, not for execution bandwidth.
In summary, you can keep most of the design the same as for a single-thread machine, and make relatively minor changes in a few places to implement SMT. This takes far less silicon than dual cores, and lets you use the functional units more efficiently and use a wide issue unit efficiently (by boosting parallelism in the instruction stream).
I'm not sure I buy this argument. It is far easier to duplicate an existing design to make another core than to modify the design to support SMT. SMT requires big fetch/decode/rename hardware, a big register file and probably big caches, too. All of this stuff is in the critical path and will be difficult to run at the high clock speeds expected from modern cores.
Clearly, this is not a good thing or a moral thing to do -- I can defend Bob and Joe trading MP3s, but if they do it via Sally's open share (and grab some of her files too), that's a totally different thing.
Why? Because in one case the theft victim is an identifiable individual and in the other it's a corporation?
The problem is, the corps are going to point to this and say: "See? These geeks are just a bunch of thieves and pirates!".
In this case, it seems fairly clear-cut that they are right:-).
If you accept that, you have to accept that Napster, etc. are just as guilty. Either the lawbreaking is performed by the individual software users or it is performed by the software creators. You can't have it one way in one situation and the other in another.
Unforunately, the built-in modules don't work with my 3Com vortex card. AFAIK, it is still recommended to use the standalone modules package until the interfaces are worked out in the kernel.
cd/usr/src/linux
make xconfig
make-kpkg clean
make-kpkg --revision=some_unique_tag kernel_image
dpkg -i../kernel-image-etc.
shutdown -r now
If you have PCMCIA modules, you'll have to get the latest PCMCIA modules package and use make-kpkg to create a.deb and install that as well. It is just as easy as making a kernel.deb.
The cores are actually able to execute in different contexts as well, not just within the same context as with SMT.
Why is this a limitation of SMT? SMT's have been simulated with multiprogrammed workloads for years. I'm honestly curious as I very likely might not be seeing something obvious.
Do you have a reference for the paper? I know several folks who'd be interested. Where did the 20% improvement of the MSC come from?
I can only assume the MSC is an abbreviation for the MultiScalar architecture (MSC == MultiScalar Computer?) that came out of Wisconsin. Is this correct?
IMHO it's not worth it. That kind of work often requires serious rethinking of data structures which generally affects every part of the program. Eventually you end up rewriting the whole program. Programmer time is too expensive for that. Better to properly design things for readability and maintainability than to try to get 5% more performance out of it.
This is why we have compilers and hardware. Compilers already do a fair amount of program transformation. Often the programmer, in a quest for "optimization," inevitably screws something up for the compiler, usually by doing "fast" pointer manipulation or using global variables.
Eliminating it absolutely is the goal of SMT! SMT works on the priciple that when a thread blocks due to a cache miss, mispredicted branch, etc. it can execute from another thread. Don't think in terms of heavyweight threads, synchronization and SMP.
An SMT doesn't really "context switch" in the traditional sense of the word. The reason it is "simultaneous" is that all threads are executing at the same time, like an SMP. It only "context switches" in the sense of fetching "more" from whatever threads are not currently blocked (or predicted to be blocked).
An MTA (muti-threaded architecture, of which SMT is a variation) really does "context switch," but it is fast enough that it can be done every few instructions. This is a very fine grade of threading, more akin to instruction-level parallelism than SMP.
With an SMT/MTA, the job of the OS (like an on SMP) is to schedule runable threads on the processor. Beyond that the hardware takes care of deciding what to fetch and from where.
For optimal performance, the compiler has to generate instructions to release registers when their values are not needed anymore.
Reference? I don't recall reading anything about "register freeing" instructions wrt. SMT. A compiler "releases" a register by redefining its value. They've done that for years.:)
It's true that a machine must hold onto a phyiscal register until a redefinition of the corresponding logical register is committed, but this isn't a problem in "traditional" O-O-O architectures, where the number of physical registers is adjusted to eliminate any difficulties this might imply. Register-caching architectures need to worry about stuff like this, but it has nothing to do with SMT per se.
Erm...huh? I've not heard of it. Can you provide a reference?
Perhaps he's referring to the G5 processor? IBM's POWER line (S390, etc.) has used multiple execution cores for a while, but not for throughput. They use them for verification and reliability. One core checks the other and if one fails, the processor shuts down and its work is transferred to another node in the SMP system.
I can easily see an engineering workstation making good use of SMT. In my environment, I'd love to fire off 4+ compilation threads to my SMT processor. Right now we have to use expensive quad Xeons to do it. SMT is cheap multiprocessing for the masses.
Heh...you've asked your way into a very complex environment.
There are many factors that affect the ILP available in "typical" programs. The two most important limiting factors are the memory subsystem and the branch predictor. Anywhere from 30%-60% (depending on architecture) of your dynamic instructions are memory operations. When these miss in the cache, it takes a long time to service them. This backs everything up in the instruction window. O-O-O tries to get around this by issuing independent instructions to the core. The problem is, either no such instructions are available or they also block on a memory operation.
On the I-fetch side, the branch predictor is responsible for feeding thie "right" stuff into the instruction queue. If a prediction is incorrect, the processor generally has to blow away everything in progress and start over on the right path. With the deeper pipelines we're seeing, this is only going to get more expensive. Even a 90% correct predictor incurs a huge penalty because the processor sees a branch every 5-8 instructions or so. The multiplicative factors ensure that accuracy diminishes quickly. No one has yet come up with a good multiway branch predictor.
So on one level, the hardware is to blame because it doesn't work right. Not only do memory and branches choke off the available ILP, the machine can't look "far away" to discover distant parallelism. Instructions after a function call, for example, are often independent of those before the call, but there is no way the processor can fetch that far ahead.
Enter the compiler. It is the compiler's job to schedule instructions such that the processor can "see" the parallelism. Unfortunately, this is very hard to do. Mostly this is due to the static nature of the compiler -- while the compiler can look far ahead (theoretically at the whole program, in fact), it doesn't know what will happen at runtime. The hardware has the advantage of (eventually) knowing the "right" path of execution. A compiler generally cannot schedule instructions above a branch because it doesn't know whether it is valid to execute them. In fact, the validity changes with each dynamic pass through the code.
We're seeing some of these limitations being lifted with dynamic translation and recompilation. Unfortunately, you're now saddling the compiler with the limitations of the hardware: limited lookahead. It is too expensive to do a "really good" job of optimization at runtime. Still, there is some improvement to be had here.
To sum up, the blame lies neither solely with the hardware or software. There is a complex interplay here that is only now beginning to be understood.
Generally, the argument is one of utilization. The goal of an SMT processor is to be as efficient as possible. Think throughput rather than latency.
With an SMP, each thread has resources dedicated to it: caches, function units, etc. In an SMT system these are shared dynamically across threads. Theoretically, each thread uses just as many resources as it needs for its level of instruction-level parallelism. So instead of each processor using, say, 2 integer units out of an available four, you now have 8 integer units being used in 90% capacity by multiple threads.
Note that these threads need not all be from the same program, either. SMT works great in a multiprogrammed environment.
Due to its ability for fast context switching, we're going to see some...interesting applications of threading. Check out the MICRO/ISCA/PACT, etc. papers on Dynamic Multithreading, Polypath architectures and Simultaneous Subordinate Microthreading (all of which, BTW increase the performance of single-threaded applications). Wild stuff is on the horizon.
Of course RL communities won't go away. But participation in them will diminish for some, as it has with the spankenstein's mother. It's somewhat analogous to the decline in the idea of neighborhoods caused by planned suburbian subdivisions and gated communities (what an absurd idea!). We no longer have the corner markets or the neighborhood parks to foster community interaction.
And as for pursuing romantic relationships, I question the validity of a truly loving relationship fostered soely on-line or with few RL encounters. It can't happen. Certainly people can meet online, maybe even become infatuated on-line. But love on-line? No way.
In all seriousness, Bush won the election because the courts said he did. This is not even all that new as the same thing happened to Hayes. In fact, I'd say the Hayes case was even more bizarre because he eventually won through Republican compromises in Congress. Not to say that's a bad thing, as it's the core of our system of government.
we won't know who the hell won this election, ever...
We know who won the election. What we don't know is how many under/overvotes would have gone to either candidate. There's a good argument to be made that none of them should go to either candidate.
The real tragedy out of all of this is that we still don't know what the voting standard for punchards is!
its kindof sad, actually... the nation that is supposedly one of the "strongest democracies" has lost all credibility...
It is sad, but not for the reasons you think. It is sad that people have so little faith in this country that a close vote is enough to cast doubt on over 200 years of a successful Republic.
face it, "the people" dont have a voice in this country anymore...
No, I will not face it, because it isn't true. People have as much voice as they are willing to put forth. When half of the population doesn't vote, we've no right to complain about the "voice of the people" not being heard.
This was a closely contested election, nothing more. It's over. There's nothing to see here. Move along.
If my Mom lived more than a few blocks from my grandma I would think that it is even cooler.
And here is the problem with the.net. Why doesn't your mom just go over to see granny?
I understand the power to bring people physically far apart close together through communication, and this is a wonderful thing. But I also think there's a danger of destroying community IRL. I have similar problems with folks stuck in "Gaming culture," "Hacker culture" and "Literary culture," just to name a few.
if and no, "they" did not select GWB, he lost the popular vote, and probably lost florida, too...
He lost the popular vote? Prove it. There are millions of uncounted absentee ballots in California alone. The fact is, we will never know who won the popular vote because it does not matter.
However, I question how much of an _impact_ the bad side effects have in practice. It's non-negligeable, but that leaves a lot of territory open.
IMHO, many people underestimate the problems associated with lots of instructions. The effective window size of the processor is reduced and more memory operations wreak havoc on dependency checking, for example.
However, most of your works focus on how the program use and physical performance of a register file of a given size may be improved.
Really, we focus on how a large register file can be used, but you are essentially correct.
he advantage of a relatively large register file (at least for the 64-vs-32 case) is found to be relatively modest (5%-20%).
Is 20% modest? It's a heck of a lot more than most hardware optimization papers get.:) The paper you cite didn't include our more recent work on speculative register promotion and some windowing stuff we're looking at.
You're right in that a 20% speed increase does not make a processor overwhelmingly more marketable. But remember that most of the speedup we see from generation to generation comes from the circuit technology.
It is interesting as you point out that x86 seems to be holding its own against machines with larger register files. I suspect this may have something to do with circuit technology, better cache utilization and other such things. I only know from our own experiments that if we turn our register set down to 8 registers, things are really hosed. Not only is there lots of spilling, but the compiler has to throttle itself to avoid even more spilling.
Granted, most of the spill code is going to be in the cache. One way to look at it is that this extra code forces the caches to be larger. Eliminating instructions (especially memory ops) reduces the size of the machine overall (caches, issue logic, etc.). This in turn will greatly reduce production costs.
Ah the interplay...this is why I love sitting on the fence!:)
However, it turns out that register renaming prevents a lot of the stalling that you'd expect with so few registers (write-after-write hazards vanish). Thus, while the small number of registers does degrade performance, it doesn't degrade it catastrophically.
This is false. Lack of registers increases the number of required memory operations. Not only do you increase data cache bandwidth and occupancy, you do the same on the instruction cache as well. Instructions are not free, regardless of whatever "throw hardware at it" myth is popular today. Memory instructions also create headaches for the instruction scheduler. It's much easier to do dependency checking on static indicies.
There are more things to fetch, decode, schedule, execute and retire. What's good about that?
It hurts. A lot. Check out some of our papers on the subject, especially
this one, which contains many references to other work.
Also, you talked about the SPEC benchmarks in your post. I'd like to stress that SPEC tests a platform, not just a CPU. It depends a lot on the memory subsystem and on the compiler. In the case of the P4, the good SPEC_FP scores are largely due to the large memory bandwidth and the use of some SSE2 code by the Intel compiler. I'd like to see scores using a more conventional compiler, it would show that these scores are not only due to Intel architects, but also to their compiler team!
You raise a very good point here. SPEC tests a system platform because that is what is relevant. No one cares about raw CPU speed, nor should they. The CPU is designed for the system in which it will function. Things like caches, instruction window size, branch prediction, etc. are all designed to tolerate the latencies of the expected system platforms.
Using a "more conventional compiler" doesn't make any sense. What's a conventional compiler? The compiler and CPU designes should go hand-in-hand.
The other negative of the x86 ISA, namely the paucity of compiler-visable registers, is indeed a problem, although one partially aleviated by rename registers and partially by evolutionary extensions to the x86 ISA, such as SSE2, which will eventually replace much of the god-awful stack-based x87 FPU ISA.
Partially correct. Hardware register renaming does nothing to alleviate the low compiler-visible register count. The compiler still has to spill much, much more than it should on the x86. It is true however that SSE2 is a big improvement. Getting rid of the FP stack will really boost performance in that arena.
Why not add more compiler-visible registers, one may ask? Well, the problem is the encoding size. The x86 ISA is very nice in the sense of code size. This matters when you consider ICache sizes. Intel can get away with smaller ICaches because more instructions fit into the cache for a given size than in a similar 32-register RISC machine. It's interesting that Intel essentially used the same trick with the trace cache. They compress the micro-ops so more of them will fit into the cache.
Needless to say, "The designers thought of it, implemented it (which they did in this case), and it was a good feature (i.e. improved performance/cost on a majority of code), but then made a boneheaded decision not to use it," is *not* on the list.
Amen! Intel engineers are not stupid. They removed these features for good reasons. Intel has a policy that a 1% area increase must result in at least 1% performance increase. This was obviously not the case with these structures.
Frankly, I am stunned at when Intel has been able to do with the x86 ISA.
Always play in the sandbox!
--
Yes, but I'm assuming equal (or as near equal as is practically possible) execution resources.
See, now that's where I disagree. In terms of raw transitors you are right. But you are forgetting the cost of design and slippage in time-to-market. It is less costly to design and fabricate an MP design from previously designed cores than it is to take such a core and modify it for SMT.
Realize that I am not saying this should never be done. Just that more evaluation is needed. Some of that evaluation will involve silicon, probably in the consumer market.
If you look at raw transistor count, I might agree with you. I depends greatly on the architecture and the expense of the duplicated vs. shared resources of the two designs (decode logic, etc.). SMT has a cycle time impact and you have to balance that against the extra transistors required for a CMP.
This is an argument I've never understood. The execution units are such a tiny, tiny part of the die that I don't see much benefit in sharing them. Sharing the decode/O-O-O logic seems more beneficial, but even an SMT requires more of that (in terms of bandwidth).
Sure about that? The POWER architecture is pretty complex. Even on a MIPS-like machine, the rename and dependency logic complexity rises rapidly with fetch/issue width. As does the wakeup logic with larger instruction windows.
My point is that raising ILP with SMT can decrease the clock speed (increase the cycle time). So you get more ILP but everything runs more slowly. A CMP with an equivalent number of contexts should get the same ILP without the extra cycle time penalty (ignoring messaging and coherence overhead). SMT does not make any one thread run faster. It increases throughput, which is exactly what a CMP does.
Pipelining is not a panacea, either. There is a limit to how deep you want to make your pipe. This is one of the reasons good branch prediction is so important -- it allows a longer pipe.
Remember also that cycle time is pretty much the only thing companies have to market, everything else (i.e. number of threads) being equal. Is this unfortunate? Clearly. But it is reality and engineers need to deal with it when making design decisions. Which reminds me to plug the excellent book The Soul of a New Machine by Tracy Kidder, a fascinating account of Data General's race to kill the VAX. There's a bit of discussion devoted to market times and perfect designs.
A CMP has two small(er) windows for two threads. An SMT has one big window for two threads. Two smaller windows should run faster. Whether they find more ILP is an open question. SMT does have the advantage that it can trade off window space, etc. between threads. I think this is more difficult to do than most people realize due to the challenges with fetch policy.
In any event, I am extremely curious to see what happens with the SMT chips coming out. Let's sit back and see if they make it! :)
--
I don't know exactly what you mean by "recycling external bus protocols." Certainly you'd want to design the core interface to be efficient in a single-die environment, but it seems to me that the fundamental protocols (cache coherence, etc.) are the same. But I'm not an MP expert, so you probably have more insight on this than I do.
Not true. The SMT has to worry about fetch policy. This is not a trivial problem to solve. Starvation is a real concern here. Two independent cores don't need to worry about the fetch stream of one interfering with that of the other.
Decode is not a trivial problem, either. IA32 has bug problems with this. My (and others') guess is that's why a trace cache was put on the P4. It's decode cache!
Then there are problems with a large, muti-ported L1 instruction cache. Or interleaved fetch, which gets back to the first problem.
Except for the additional routing logic and wiring.
First off, pet peeve of mine. This is not directed to you personally, but to the computer architecture community in general. It's function (or execution, etc.) unit, not functional unit. I would hope all our execution units are functional. :)
As for the physical file, distributing it implies a non-uniform register acces time a la the 21264. It's not an impossible problem to handle but there is a penalty with such a large file. Register caching can help with this and it may not be a large concern in the end. More study is needed in this area.
Certainly. This is why engineering is fun. :) The point of my post is that SMT is not a guaranteed win. It might be beneficial in some situations, but not all. I'm not sure it's justified for a POWER-class machine.
Ah, but there is no single-thread machine that has the bandwidth of the proposed SMT schemes. Building a 4-way machine is a challenge. 8-way should be much more challenging.
That's not true on server-class machines. Capacity is still a problem. This is why we see superlinear speedup on MP systems. The extra cache on each chip makes the machine as a whole run faster than you would expect, given the number of processors. SMT takes this distributed cache and puts it in one big array. This will slow it down in one way or another.
Eh? ILP has nothing to do with cycle time, save the impact on cycle time that complex O-O-O hardware can have. One cannot "get around" a slow clock through scheduling. Pipelining can be used, but that has its own costs.
Hmm...maybe. But you argue above that any extra latency in the system can be masked by the O-O-O engine. To me, this implies a larger window. Eventually one thread or another is going to get backed up waiting on memory. When that happens you have to have room available to fetch from the other threads. Has anyone done any studies of instruction queue utilization in SMT? I'd like to know how often the queue is full and how big it has to be to sustain execution. I seem to recall some of the work out of U. Wash. doing this, but I can't find the reference at the moment.
All this is not to say that STM is worthless. Far from it. In fact, the fast thread context switching allows some super-cool techniques not previously possible. I'm trying to temper the enthusiasm for SMT a bit. Think of me as a Devil's Advocate. :)
--
It has two cores for redundancy and error-checking, not for execution bandwidth.
I'm not sure I buy this argument. It is far easier to duplicate an existing design to make another core than to modify the design to support SMT. SMT requires big fetch/decode/rename hardware, a big register file and probably big caches, too. All of this stuff is in the critical path and will be difficult to run at the high clock speeds expected from modern cores.
--
Why? Because in one case the theft victim is an identifiable individual and in the other it's a corporation?
If you accept that, you have to accept that Napster, etc. are just as guilty. Either the lawbreaking is performed by the individual software users or it is performed by the software creators. You can't have it one way in one situation and the other in another.
--
--
Or just use Debian and do:
cd /usr/src/linux ../kernel-image-etc.
make xconfig
make-kpkg clean
make-kpkg --revision=some_unique_tag kernel_image
dpkg -i
shutdown -r now
If you have PCMCIA modules, you'll have to get the latest PCMCIA modules package and use make-kpkg to create a .deb and install that as well. It is just as easy as making a kernel .deb.
Man, Debian rocks! :)
--
Do you have a reference for the paper? I know several folks who'd be interested. Where did the 20% improvement of the MSC come from?
I can only assume the MSC is an abbreviation for the MultiScalar architecture (MSC == MultiScalar Computer?) that came out of Wisconsin. Is this correct?
--
This is why we have compilers and hardware. Compilers already do a fair amount of program transformation. Often the programmer, in a quest for "optimization," inevitably screws something up for the compiler, usually by doing "fast" pointer manipulation or using global variables.
--
An SMT doesn't really "context switch" in the traditional sense of the word. The reason it is "simultaneous" is that all threads are executing at the same time, like an SMP. It only "context switches" in the sense of fetching "more" from whatever threads are not currently blocked (or predicted to be blocked).
An MTA (muti-threaded architecture, of which SMT is a variation) really does "context switch," but it is fast enough that it can be done every few instructions. This is a very fine grade of threading, more akin to instruction-level parallelism than SMP.
With an SMT/MTA, the job of the OS (like an on SMP) is to schedule runable threads on the processor. Beyond that the hardware takes care of deciding what to fetch and from where.
--
Reference? I don't recall reading anything about "register freeing" instructions wrt. SMT. A compiler "releases" a register by redefining its value. They've done that for years. :)
It's true that a machine must hold onto a phyiscal register until a redefinition of the corresponding logical register is committed, but this isn't a problem in "traditional" O-O-O architectures, where the number of physical registers is adjusted to eliminate any difficulties this might imply. Register-caching architectures need to worry about stuff like this, but it has nothing to do with SMT per se.
--
Perhaps he's referring to the G5 processor? IBM's POWER line (S390, etc.) has used multiple execution cores for a while, but not for throughput. They use them for verification and reliability. One core checks the other and if one fails, the processor shuts down and its work is transferred to another node in the SMP system.
--
--
There are many factors that affect the ILP available in "typical" programs. The two most important limiting factors are the memory subsystem and the branch predictor. Anywhere from 30%-60% (depending on architecture) of your dynamic instructions are memory operations. When these miss in the cache, it takes a long time to service them. This backs everything up in the instruction window. O-O-O tries to get around this by issuing independent instructions to the core. The problem is, either no such instructions are available or they also block on a memory operation.
On the I-fetch side, the branch predictor is responsible for feeding thie "right" stuff into the instruction queue. If a prediction is incorrect, the processor generally has to blow away everything in progress and start over on the right path. With the deeper pipelines we're seeing, this is only going to get more expensive. Even a 90% correct predictor incurs a huge penalty because the processor sees a branch every 5-8 instructions or so. The multiplicative factors ensure that accuracy diminishes quickly. No one has yet come up with a good multiway branch predictor.
So on one level, the hardware is to blame because it doesn't work right. Not only do memory and branches choke off the available ILP, the machine can't look "far away" to discover distant parallelism. Instructions after a function call, for example, are often independent of those before the call, but there is no way the processor can fetch that far ahead.
Enter the compiler. It is the compiler's job to schedule instructions such that the processor can "see" the parallelism. Unfortunately, this is very hard to do. Mostly this is due to the static nature of the compiler -- while the compiler can look far ahead (theoretically at the whole program, in fact), it doesn't know what will happen at runtime. The hardware has the advantage of (eventually) knowing the "right" path of execution. A compiler generally cannot schedule instructions above a branch because it doesn't know whether it is valid to execute them. In fact, the validity changes with each dynamic pass through the code.
We're seeing some of these limitations being lifted with dynamic translation and recompilation. Unfortunately, you're now saddling the compiler with the limitations of the hardware: limited lookahead. It is too expensive to do a "really good" job of optimization at runtime. Still, there is some improvement to be had here.
To sum up, the blame lies neither solely with the hardware or software. There is a complex interplay here that is only now beginning to be understood.
--
With an SMP, each thread has resources dedicated to it: caches, function units, etc. In an SMT system these are shared dynamically across threads. Theoretically, each thread uses just as many resources as it needs for its level of instruction-level parallelism. So instead of each processor using, say, 2 integer units out of an available four, you now have 8 integer units being used in 90% capacity by multiple threads.
Note that these threads need not all be from the same program, either. SMT works great in a multiprogrammed environment.
Due to its ability for fast context switching, we're going to see some...interesting applications of threading. Check out the MICRO/ISCA/PACT, etc. papers on Dynamic Multithreading, Polypath architectures and Simultaneous Subordinate Microthreading (all of which, BTW increase the performance of single-threaded applications). Wild stuff is on the horizon.
--
And as for pursuing romantic relationships, I question the validity of a truly loving relationship fostered soely on-line or with few RL encounters. It can't happen. Certainly people can meet online, maybe even become infatuated on-line. But love on-line? No way.
--
GWB. Where have you been? :)
In all seriousness, Bush won the election because the courts said he did. This is not even all that new as the same thing happened to Hayes. In fact, I'd say the Hayes case was even more bizarre because he eventually won through Republican compromises in Congress. Not to say that's a bad thing, as it's the core of our system of government.
We know who won the election. What we don't know is how many under/overvotes would have gone to either candidate. There's a good argument to be made that none of them should go to either candidate.
The real tragedy out of all of this is that we still don't know what the voting standard for punchards is!
It is sad, but not for the reasons you think. It is sad that people have so little faith in this country that a close vote is enough to cast doubt on over 200 years of a successful Republic.
No, I will not face it, because it isn't true. People have as much voice as they are willing to put forth. When half of the population doesn't vote, we've no right to complain about the "voice of the people" not being heard.
This was a closely contested election, nothing more. It's over. There's nothing to see here. Move along.
--
--
And here is the problem with the .net. Why doesn't your mom just go over to see granny?
I understand the power to bring people physically far apart close together through communication, and this is a wonderful thing. But I also think there's a danger of destroying community IRL. I have similar problems with folks stuck in "Gaming culture," "Hacker culture" and "Literary culture," just to name a few.
--
He lost the popular vote? Prove it. There are millions of uncounted absentee ballots in California alone. The fact is, we will never know who won the popular vote because it does not matter.
--
--
IMHO, many people underestimate the problems associated with lots of instructions. The effective window size of the processor is reduced and more memory operations wreak havoc on dependency checking, for example.
Really, we focus on how a large register file can be used, but you are essentially correct.
Is 20% modest? It's a heck of a lot more than most hardware optimization papers get. :) The paper you cite didn't include our more recent work on speculative register promotion and some windowing stuff we're looking at.
You're right in that a 20% speed increase does not make a processor overwhelmingly more marketable. But remember that most of the speedup we see from generation to generation comes from the circuit technology.
It is interesting as you point out that x86 seems to be holding its own against machines with larger register files. I suspect this may have something to do with circuit technology, better cache utilization and other such things. I only know from our own experiments that if we turn our register set down to 8 registers, things are really hosed. Not only is there lots of spilling, but the compiler has to throttle itself to avoid even more spilling.
Granted, most of the spill code is going to be in the cache. One way to look at it is that this extra code forces the caches to be larger. Eliminating instructions (especially memory ops) reduces the size of the machine overall (caches, issue logic, etc.). This in turn will greatly reduce production costs.
Ah the interplay...this is why I love sitting on the fence! :)
--
This is false. Lack of registers increases the number of required memory operations. Not only do you increase data cache bandwidth and occupancy, you do the same on the instruction cache as well. Instructions are not free, regardless of whatever "throw hardware at it" myth is popular today. Memory instructions also create headaches for the instruction scheduler. It's much easier to do dependency checking on static indicies.
There are more things to fetch, decode, schedule, execute and retire. What's good about that?
It hurts. A lot. Check out some of our papers on the subject, especially this one, which contains many references to other work.
--
You raise a very good point here. SPEC tests a system platform because that is what is relevant. No one cares about raw CPU speed, nor should they. The CPU is designed for the system in which it will function. Things like caches, instruction window size, branch prediction, etc. are all designed to tolerate the latencies of the expected system platforms.
Using a "more conventional compiler" doesn't make any sense. What's a conventional compiler? The compiler and CPU designes should go hand-in-hand.
--
Partially correct. Hardware register renaming does nothing to alleviate the low compiler-visible register count. The compiler still has to spill much, much more than it should on the x86. It is true however that SSE2 is a big improvement. Getting rid of the FP stack will really boost performance in that arena.
Why not add more compiler-visible registers, one may ask? Well, the problem is the encoding size. The x86 ISA is very nice in the sense of code size. This matters when you consider ICache sizes. Intel can get away with smaller ICaches because more instructions fit into the cache for a given size than in a similar 32-register RISC machine. It's interesting that Intel essentially used the same trick with the trace cache. They compress the micro-ops so more of them will fit into the cache.
Amen! Intel engineers are not stupid. They removed these features for good reasons. Intel has a policy that a 1% area increase must result in at least 1% performance increase. This was obviously not the case with these structures.
Frankly, I am stunned at when Intel has been able to do with the x86 ISA.
--