Emergence of SMT
yellow writes "SMT, or Simultaneous Multithreading, is a concept that is rapidly gaining adherence in the microprocessor area. It essentially allows for a single processor with multi-processor capabilities in both ILP (Instruction Level Parallelism) and TLP (Thread Level Parallelism). When comparing SMT vs. dual or multiprocessor performance data, it is important to compare apples to apples, and understand why under an OS such as Win 98 or ME, even a single Pentium III 1 GHz will handily outperform a dual-Pentium III 500 MHz setup. This is the discussion topic of a new feature on HWC."
Slashdot has been trolled again.
The processor world right now is 99% driven by embedded systems. You need to put multiple cores on the same die to boost embedded performance since there's not enough room for multiple processors. Now that PC's are dead we're seeing a spike in SMT press releases even though the technology has been floating around forever.
The fact that Intel's last processor release was a "mobile processor", Intel's future x86 vapormap includes pure SMT chips, and Compaq's future vapormap includes an SMT alpha shows how important that size reduction is.
--
I have read several articles about both Suns and Intels SMT tech and they all say it onlys adds about 10% to the die size...Email me if you would like links.
> SMT? Blow it out your ass.
Rather unnecessary, don't you think? I don't think you would be quite so hostile to SMT if you had a better understanding of it.
Time to do some informing...
> Surely this 'virtual-multi-CPU' system can only decrease the sheer number of operations per second a CPU of a given size/speed can do?
This statement doesn't make much sense. I think you mean that a CPU which spent the die area on additional functional units as opposed to SMT support could achieve a greater MIPS value. This is true, but only for theoretical MIPS. Simply adding more functional units to modern chips, would *not* improve actual performance. Explanation follows...
> The overhead - whether it be in sacrificed MIPS or die area, of distributing instructions among execution units is going to be significant, compared to a maxed-out single core design.
So you are stating that implementing SMT comes at a cost in die area? Of course it does, but the important point is that using that die area instead to add more conventional execution units would *not* increase the performance of the processor. Why, you ask? There is a limited amount of instruction level parallelism available in a single thread of program execution. Current wide-issue superscalars get something on the order of 2.3 dispatches per clock, despite the fact that they have the *capability* to issue far more. The processor simply can not find enough independent instructions to keep it's functional units full. If memory serves, the Athlon is a 9-issue core. You could add functional units up to 12-issue or more, but your actual dispatch rate would still be around 2-3 per clock. While your theoretical performance would increase your actual performance would remain stagnant.
So, current software does not exhibit enough parallelism to keep the functional units in even current processors busy. SMT proposes to increase available parallelism by issuing instructions from *multiple* threads at once. Instructions from different threads are guaranteed to be independent, so if you have n threads running at once, your number of available instructions for dispatch each clock is improved about n times. Of course, this method has a cost in complexity and area -- now the CPU has to have knowledge of threads, and keep a process table on die. However, provided many threads are run at once, this *greatly* increases the utilization of the processor's resources, and thus the performance of the part.
> Since reading and writing to various RAM caches are the biggest bottlenecks in the current PC architecture, adding more units is just going to lead to increased contention for these resources.
...this is a valid point. SMT particularly increases the burden on instruction fetch and cache, since it is pulling from several different streams at once. However, there are methods that can somewhat compensate for the contention of resources introduced. Now, you have multiple threads available at all times. So, when one thread stalls on a cache miss, the processor can dispatch a different thread to run while the cache miss on the first is being serviced. This effectively hides the latency of the cache miss since the processor is able to do useful work during the service. You see, it's all about keeping those functional units busy.
> So many CPU cycles are wasted with the current generation of software that it seems a bit pointless increasing the number of potential instructions you could perform..
If you believe this, then you should be pro-SMT. SMT doesn't address increasing potential instructions performed per second. Instead, it is an attempt to close the gap between *actual* performance and *theoretical* performance by keeping more of your processor busy.
>you have to question the thinking behind such a modification.
How about the TLA Lookup Archive?
Btw. a similar sollution was implemented in Atari's Jaguar 64-bit console, the RISC processors (two of them) both had 64 registers, divided into two banks of 32 registers each. Then they had a special instruction for switching banks. This was very useful in the Jaguar architecture which was very depending on fast interrupt handling (hardware like the blitter sent back an interrupt request as soon as it was finished with its task), you simply let your main thread get one bank and your interrupts share the other one. Never had to push/pop registers on the stack.
Optimization is all about turning compute-bound problems into I/O bound problems. Most processors are fast enough these days that they're I/O bound. Particularly if you have an OS that tends to keep L1 completely thrashed. (I recall seeing numbers which showed that Win9x tends to keep L1 completely hosed, whereas WinNT and Linux do not.)
Note that I/O bound doesn't necessarily mean bandwidth starved -- latency is also an issue with many typical tasks. Notice how RAMBUS machines tend to perform fairly slowly on many non-latency-tolerant tasks. It's not for lack of bandwidth.
What's worse is that we've passed the sweet spot in cache sizes such that L1 cache sizes are going to stop increasing and start decreasing again, sadly. (Transport delay in getting bits from the far side of L1 is limiter, I understand.)
Now the statement that most machines spend their time waiting for disk, I don't think that's quite true. That's maybe true under Windows (it certainly seemed to be last time I used it regularly, even with gobs of RAM handy), but it's certainly not true under Linux or Solaris. I almost never hear my HD run under normal circumstances. Even when it does run, it's not chugging incessantly. It's called "having enough RAM."
Even with enough RAM, though, PC133 SDRAM is quite a bit slower than L2, which is noticably slower than L1. Anything that doesn't stay in L1 most of the time is going to quickly bottleneck on the other levels of the memory hierarchy. Not much you can do about it, either.
--Joe--
Program Intellivision!
That's true only if you have sufficient hardware registers (not to be confused with architectural registers), and your tasks aren't bottlenecked on memory. If you have truly independent tasks, then each will effectively see half as much cache at all levels of the hierarchy. This can get especially painful in L1I -- if you can't keep the CPU fed from both streams, oops! And, large register files can be a speed limiter in the architecture. (On the plus side, though, the hardware register files can be distinct between the threads, so it's not too bad. As I understand it, it's the unified register files that are the real problem.)
One of the main attractive things I see in SMT is that you can effectively make the pipeline deeper on the architecture while completely hiding it. This is important on VLIW-style architectures that have an exposed pipeline. To make it deeper, you need to somehow hide the fact that it's deeper than the code thinks it is. One way is to add interlocks (gradually making stages of the pipeline protected, rather than exposed). Another is to interleave multiple threads, so that each thread sees the pipeline length it expects, but the actual pipeline is some factor larger.
Of course, in this VLIW world, things can get tricky outside the CPU. Because you lack superscalar issue (that's what VLIW's about), the issue of stalls becomes a problem. "One-stall-all-stall" is an oft-mentioned SMT VLIW technique, and with it, you really need to make sure you aren't bottlenecked on memory before you go down the SMT path. "One-stall-all-stall" means if one SMT thread stalls, all threads stall... As I understand it, it's the "cheap" way to maintain the VLIW state in an SMT VLIW machine, but it also amplifies any memory system bottlenecks you might have.
--Joe "Mr. VLIW"--
Program Intellivision!
Actually, the whole SMT thing is so new that people aren't agreeing on what exactly it means.
Some think it means multiple cores on one chip, a la the new POWER4 from IBM. Apparently IBM doesn't think so, since they don't call that SMT themselves. If you go by that definition, then yes, you have just increased CPU power without increasing the overall bandwidth of the system, and that gives you sucky performance.
However, the other definition of SMT is a single core capable of keeping multiple contexts and switching between them without software help. To software they look (almost) like two regular CPU's, and so e.g. linux would assign two processes or threads to each core. The idea is that the core will execute the instructions from virtual CPU #1 until it hits a cache miss. Then it switches to executing instructions from virtual CPU #2, until it hits a cache miss... and so on.
In the second scenario you have a CPU with the same MIPS as before, but suddenly you are not wasting as much CPU power waiting. In the context of the PC industry that means you can get away with smaller caches and memory with higher latency (hello RAMBUS).
Finally! A year of moderation! Ready for 2019?
Some would argue that EPIC (the ISA for IA64 chips like Itanium) is a fundamental change in hardware design - it combined VLIW (an old idea) with explicit encoding of inter-opcode dependency, among other things. That kind of explicit "helper" data for internal ILP engines could end up proving valuable, if the compiler technology can keep up.
The BOTTLENECK is the memory bus. Refilling the cache from a hose that's 1/10th the speed of the processor. That's why CISC lasted so long in the first place, and is still with us today. CISC has variable length instructions. If you can express an instruction in 8 bits, you do so. 16 for the more complex ones, 32 bits for the really complex ones. So when you're sucking data into the cache 32 bits at a time, you can get 2 or 3 instructions in a 32 bit mouthful. (Or, in the case of pentiums, 64 bits to feed 2 cores, but the principle's the same.) You're optimizing for the real bottleneck with compressed instructions.
The fixed length instructions of RISC can be executed 2 at a time because you don't have to decode the first one to see where the second one starts. But Sparc, PowerPC, and even Alpha haven't displaced Intel because the real bottleneck is the memory bus, and bigger instructions aren't necessarily a win. (That and Intel translates Cisc to Risc inside the cache, and pipelines stuff.)
VLIW as iTanium picked it up sucks so badly because the real bottlneck is sucking data from main memory, and now they want 192 bits of it per clock! For only three instructions, and on average at least one will probably be a NOP. Crusoe has a MUCH better idea, sucking compressed CISC instructions in and converting them to VLIW in the cache (like Pentium and friends do for CISC to RISC).
This multi-threading stuff is just a way to keep the extra VLIW execution cores from being fed NOPs. They don't deal with the real problem, the memory bus de-optimization by reverting to full-sized instructions all the time.
Rob
The idea here is that if your program has many threads, you can run them all at the same time. Since each thread does not depend on the results of another thread's instructions (at least in the common case, if they're fighting over a lock, that's different), then there is no reason why they shouldn't be able to run at the same time. There are almost no data dependencies between threads, hence a whole lot of blindingly obvious ILP that the processor can take advantage of. This doesn't cause your programs to run any faster than a processor that only runs a single thread, but it allows more threads to run "slower" at the same time.
/ \
\ / ASCII ribbon campaign for peace
x
/ \
This is perhaps one of the most useful sites in today's world of technobabble: www.acronymfinder.com. It lists 19 different meanings for "SMT", none of which are Simultaneous MultiThreading! :-)
/ \
\ / ASCII ribbon campaign for peace
x
/ \
A nice (and quite technical) article on the SMT for the Alpha can be found here
It is a 3-part article, click on "Alpha EV8 (Part x): Simultaneous Multi-Threat".
One of the things I like about SMT is that as it quite "cheap", it has a chance to spread quite effectively.
And when there will be a huge base of SMT ready CPUs, we will see AT LAST more software which takes advantage of parallelism (be it SMT or SMP).
IMHO, SMT is a load. Modern microprocessors are mostly cache-starved. SMT puts two processors on the wrong side of the L1$, aggrevating the cache bandwidth problem. Worse, the two processors in SMT degrade referential locality, further degrading the performance of the cache.
You overlook a couple of very important factors.
First of all, it would cost you almost no extra silicon or latency to have duplicate L1 caches, and to add a selection bit to the addresses sent out on memory operations.
Secondly, technologies like SMT help _save_ you when you have a cache miss, because you still have an instruction stream that can execute while one thread's waiting for data.
You might want to doublecheck the terms you're using:
Multiple pipes are a relatively old idea. Ditto instruction-level parallelism, which is one of the analytical quantities used to judge how well multiple pipes will work in a given situation. SMT is a relatively new idea that lets you easily boost the instruction-level parallelism, which in turn makes scheduling and issuing instructions *much* easier.
One of my favorite computer architectures is the CDC 6000 series. It had a Peripheral Processor (PP) that did all of the system I/O. The main CPU crunched numbers while the PP dealt with the outside world. The cool thing about the design of the PP was that it appeared to be 10 independent processors, even though it only had one ALU, instruction decoder etc. This was accomplished by a "barrel" of 10 sets of CPU registers and memory banks. The PP would rotate the barrel every time an instruction was fetched and executed, turning one physical CPU into 10 virtual CPUs. This meant that the PP could simultaneously execute 10 different programs wihout having 10 hardware CPUs. I've often wished there was a microprocessor that could do this. It would be great for embedded real-time systems and I/O controllers. Each I/O device and/or subsystem could have its own virtual CPU, that would never get swiped by other tasks or I/O interrupts.
Mea navis aericumbens anguillis abundat
Long live Big Blue!
(jfb)
To spur "enterprise Linux," Big Bang, the distributed two-phase commit.
True enough. But when is EV8 scheduled to launch? I was under the impression that POWER4 and EV7 were going to appear more or less at the same time, with EV8 much later.
And while I don't doubt that an eight-way POWER4 unit will be as terrifyingly expensive as it is fast, would an equivalent EV8 system be any cheaper? Either way, it'll be interesting at the high end again, now that Alpha finally has competition.
Peace,
(jfb)
To spur "enterprise Linux," Big Bang, the distributed two-phase commit.
That's more or less what I figured. It will be an interesting couple of years.
(jfb)
To spur "enterprise Linux," Big Bang, the distributed two-phase commit.
IIRC, it was even more limited than that. It was certain filters that were SMP enabled. There would be little value and great expense in SMP-enabling the basic UI.
IIRC further, there have long been available daughter boards for various Mac platforms to do filter acceleration in Photoshop. I think these have ranged from general purpose CPUs to DSPs. Earliest one I can remember was for JPEG compression, although I think more commonly they were used for stuff like gaussian blurs. I can remember working at prepress shop on an unaccelerated '030 and waiting like 5 minutes to do filters on a 30 MB file, so they had some value.
Yup, kinda reminds me when people were getting all excited about those dual CPU boxes that Apple was selling to take attention away from the megahertz gap vs. x86. Yeah, nevermind that hardly at anything at all can utilize more than one processor under MacOS, it's got two CPUs! w00t!
Cheers,
You're not wrong, which is why I said "hardly anything at all" can use more than one CPU under MacOS, instead of saying that nothing can. The amount of software that can, though, is extremely limited.
Cheers,
While in many senses you are right I think you are pointing to the wrong issue. It is not something inherent in the x86 arch that causes problems in scaling it is mostly Intel's SMP bus design. Having all the CPU's share a single, shared, bus between each other and system memory is the bottleneck. I mean, look at the Athalon, it isn't riding on an Intel designed bus, it rides on a DEC desigend EV6, originally made for the Alpha.
While Beowulf is a nifty technology, it does not solve all the scaling problems as you might think. Beowulf clusters are only useful for a specific subset of available problems, stuff that can be easily split up and sent to many, semi-independent, processing nodes. Beowulf clusters are generally connected together with 1Gb or 100Mb Ethernet which does not have high bandwidth or low latency compared to the CPU-Memory bus in even the cheapest computers. I would take a single 128 CPU box over 64 dual proc boxes connected via 1Gb Ethernet (or even Myrinet) any day.
-- Remember: Wherever you go, there you are!
I independantly invented what people are now calling SMT in 1994.
I am soooooo cooool.
I'm much more interested in enhanced cache ideas like IRAM that seek to enhance performance by putting a very large L2$ on chip by combining the discrete logic circuits of the CPU and static L1$ with the capacitor cell circuits of DRAM.
Crispin
----
Crispin Cowan, Ph.D.
Chief Research Scientist, WireX Communications, Inc.
Immunix: Security Hardened Linux Distribution
The world needs a central registry for TLAs, to ensure confusing multiple definitions can't be used in the same domain.
Maybe ICANN would take the job on ?
Stack overflow.
Actually, the tradeoff between SMT and single chip multiprocessors is not quite so simple. By adding SMT to a chip you increase the complexity of the design. This means you have to spend more time designing it, and more time testing it.
With a single chip multiprocessor (SCMP) you can design a smaller, simpler processor and spend more time performance tuning it. Once you have a single core working and tested, you can stamp it out multiple times. The complexity then gets put in the glue logic which is used to communicate between cores and share caches between them. This problem is well understood from the design of conventional multiprocessors.
Basically, since SMT is new, the design takes longer. SCMP relies of understood technologies, and potentially could be put into production faster.
Well, where do we start...
Multi-core CPUs like IBM's are SMP-on-a-chip, which is not the same as SMT by any stretch.
SMT, because more of the functional units on the chip are staying active at one time, increases heat and power consumption just about as much as SMP-on-a-chip, though it may be marginally better because the core-level overhead won't be present.
"SMP-aware" applications? Yeah, you need something like that with the Mac and its cooperative multitasking and wacky thread model. However, with any normal preemptive multitasking, thread supporting OS, introducing threading into a program makes it "SMP-aware" by default (though you may find new/different bugs on an SMT or SMP system).
The only thing I can think of that would be an "SMT optimization" at the application programming level would be threading any floating point calculations separately from integer calculations, thus allowing the FP units to be running independently from the rest of the application.
Methinks Vince needs to bone up at little...
Some more links:
Universiity of Washington SMT info, this is also linked to from the UMass link previously posted
Look at some more Alpha specifics from the source
I believe these Real World guys were quoted in the last Slashdot SMT reference (and look, Hemos posted that one too... you'd think they'd read the links...>
Disclaimer: I worked for Compaq (though not the DEC side) on porting an OS to Alpha a couple years ago and having to be aware of SMT in EV8 coming down the pike.
The BOTTLENECK is the memory bus.
OK.
That's why CISC lasted so long in the first place, and is still with us today.
No way. CISC (i.e. x86) lasted so long because of duopoly action and backward compatibility. In fact, like you said, CISC is dead because even since Pentiums, x86 chips have been RISC on the inside and CISC to the outside world (to varying degrees).
The fixed length instructions of RISC can be executed 2 at a time because you don't have to decode the first one to see where the second one starts.
Or n at a time. Any OOO RISC processor these days worth its snot decodes 4 ops/clock, some are at 6 or 8. (If it can't retire that fast, it doesn't really matter...)
Alpha haven't displaced Intel because the real bottleneck is the memory bus
Really? For scientific computing, which is where you have really big datasets and memory bandwidth is key? I don't think you see x86 there very much. You see DEC, IBM, Sun and HP. Who are all, surprise, surprise, RISC-based hardware vendors. Many RISC chips (Alpha, POWER, PowerPC) have long since passed x86 in sheer performance, especially on FP. Intel has defintely won in price/performance, but I would argue that's more due to volume than anything else.
This multi-threading stuff is just a way to keep the extra VLIW execution cores from being fed NOPs.
Umm, Alpha EV8 uses SMT. Not VLIW. Itanium is VLIW-like. Doesn't use SMT. Example no worky.
Not saying that the concept is wrong, that SMT as a concept might alleviate some of the performance issues with superfluous instructions in a VLIW instruction stream. But that's sort of the point of VLIW, to let the compiler, rather than OOO hardware, figure out how to best use the available functional units as much as possible. It puts NOPs in to keep the instruction stream balanced so the decoder can work in a predictable way just like in RISC.
Say what? Most applications can't fill a deep pipe, even out-of-order and with aggressive prefetching. The ways this stuff wins include having two (or maybe more!) instruction streams to crunch, and switching away from the one that's now blocking on a memory access. Prefetch on the other surely completed already ...
The P4 is a good example of a pipeline
that's too long.
And by the way, why has this taken so long to arrive? It's still not something I can purchase yet, and I first heard of it back in 1992. There's something fishy.
well, the best(*) a 100% parallel system could do in that configuration would be to equal the faster chip. 2 x 500 = 1 x 1000 !
Now if the article had said that a 2 x 800 was just barely as fast as a 1 x 1000, under an OS with the capability to use both CPUs, that would have been noteworthy. But this must have been one of the least informative comparisons I've heard in a long time.
(*) disregarding memory bandwidth, which depends on busses and what not.
Don't forget that under an SMP-aware OS running SMP-optimised processes, 2 x 933MHz can be faster and is much cheaper than a single 1.5GHz chip...
Surely this 'virtual-multi-CPU' system can only decrease the sheer number of operations per second a CPU of a given size/speed can do?
The overhead - whether it be in sacrificed MIPS or die area, of distributing instructions among execution units is going to be significant, compared to a maxed-out single core design.
Since reading and writing to various RAM caches are the biggest bottlenecks in the current PC architecture, adding more units is just going to lead to increased contention for these resources.
So many CPU cycles are wasted with the current generation of software that it seems a bit pointless increasing the number of potential instructions you could perform..
Its like putting a 700 cubic inch supercharged W16 engine constructed from 3 straight-8 blocks into a VW Kombi van.
Sure, it'll theoretically go pretty fast, but when its parked by the side of the road 340 days out of the year and only ever driven by a bunch of hippies who are too stoned to see the road properly at 20 kmph, you have to question the thinking behind such a modification.
I gots ta ding a ding dang my dang a long ling long
- It includes 2 tightly coupled processor units
In SMT, you have multiple threads of execution (eg. multiple PCs) feeding one CPU.Now with a superscalar/OO system things are different - you can keep partially done stuff sitting in reservation stations (but you need twice as many of them which may cost you in cycle time - which is what marketting are interested in). There's still probably a lot of serializing things around (like L2 miss). On the other hand one place you can win is with the main memory subsystem - these days you can build systems with multiple outstanding memory transactions (Rambus - no matter how people dislike it - is particularly good at potentially having many concurrent senses running in parallel) - but to be usefull you need to either move the main memory interface onto the CPU or get rid of tacky serializing buses like slot-1 (it's much better to put the memory interface on the CPU in this case because you can run the internal memory interface at a higher clock rate and be exposed to more parallelism).
Seriously though threading like that is kind of at odds with todays very long pipelines (basicly the cost of a thread switch can be very high if you have to fill a deep pipe). With heavily out-of-order systems this can be less of a problem .... but you're still stuck with the problem that if you're using a larger percentage of the CPU's real clocks then you're going to put more pressure on shared resources like caches and TLBs - larger L1s/TLBs are going to potentially hit CPU cycle time and of course these days L2 can take a large percentage of your die size (after all the goal here is to get more usefull clocks/area)
it does however list seaside music theatre. ummm hmmm. besides everybody already knows it means surface mount anyway, not this spurious muffin threbbing you guys are talking about.
understand why under an OS such as Win 98 or ME, even a single Pentium III 1 GHz will handily outperform a dual-Pentium III 500 MHz setup.
This is the expected result under ANY operating system. Multiprocessing only helps for problems that lend themselves to highly parallel processing (exploring independent address spaces of cryptography problems, for example), which many don't, and in any case incurrs overhead.
--
"that's not encryption - it's a new perl script that I'm working on..." - from some Matrix parody
For your bedtime reading, y'all.
--Seen
"I used to be a dilettante. Then I thought I'd try something else for a while."
x+x is not greater than x*2. but it does equal 2x :)
Free Techno/Jazz/DNB/MI Music by guys obsessed with monkeys!
Read up on MTA, it's cool. Supports 128 active threads per processor.
For those with a technical bent who were disappointed by the lack of information on SMT in the linked artilce, here are some better resources:
Introduction to Simultaneous Multi-threading from UMass .
Quick Quiz on SMT.
Caches for Simultaneous Multithreaded Processors: An Introduction
My feelings shall be vindicated when SMP Athlon machines become readibly available. Their comparatively minor bandwidth advantage will let them blow similarly-clocked Intel boxes out of the water.
Personally, I feel that the best way to scale x86 to supercomputing levels is through clustering, such as is offered by the venerable Beowulf for GNU/Linux. GNU/Linux, for better or worse, is continuing to grow in popularity, and I would like to see commercial software vendors try releasing Beowulf-enabled software for Linux. Imagine being able to buy Oracle for Beowulf! Okay, poor example; Oracle is memory-intensive rather than CPU-intensive, and a RDBMS is one application which is so dependant on a fast disk and good caching that that advantages pale in comparison to the potential problems. What would really be cool are Beowulf ports of statistical analysis and 3D-rendering software. Oooh, yeah... after all, The Matrix and Titanic have both proven the effectiveness of free x86 Unix-workalikes in render farms... I believe that those two movies respectively used FreeBSD and GNU/Linux.
--
--
I like to watch.
I do. I've been working on it for several years now. The project is alive and well, and we fully intend to deliver an SMT processor. Of course delivery is still a couple years off (designing processors takes a really long time), but we'll get there eventually.
BTW, for those of you interested in SMT, we are hiring.
P.S. Great response to the nay-sayer. Wish I had written it. :-)
CVS is teh suck. Use Vesta instead.
Yes it is. Since cost is not directly proportional to speed, you can buy two processors at greater than half the speed of one. If we assume adding a processor speeds up your application by 50% (a bad case), we can buy two 666MHz PIIIs at $115 each (from pricewatch), instead of one 1GHz PIII at $240, and get equivalent performance (666+333=1000). Then you also have some money left for a more expensive motherboard. 50% is low though, depending on application you should be able to get it to 75-80%. If you also add overclocking into the picture, you can save a lot of money (I have two Celeron 300A running at 450MHz each, this would be equivalent to something like a PIII at 800MHz, which didn't even exist when I bought them).
1GHz Thunderbirds seem to be around $165, while 650MHz Thunderbirds are ~$65, so that would make SMP even more cost-effective. It's a bit difficult to compare to combos, since there aren't any combos for SMP Thunderbirds (or any motherboards at all yet). I look forward to buying an SMP Athlon system, but for now they don't exist.
IBM's Power 4 Architecture was designed to exploit SMT. They're looking to leapfrog Sun and Dec in server performance.
If they're anything like me they keep a Win box arounf for IE & Games and use VNC or some such to browse but a proper computer around for doing proper work
.oO0Oo.
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
So, you're just wrong. Trying reading something about what you're commenting on first.
The point of SMT is that if one thread gets a cache stall because it has to hit main memory, then another execution thread has its instructions loaded into the CPU. SMT is actually one way to help reduce the CPUMEMORY botttleneck.
I'm a little confused. Don't some processors use register windows to speed up context switching? How is this any different?
PS I'm not trolling, I really don't understand how this is new.
pornking
Let's leave 98/Me out of the equation for now.
Just how is a dual 500 p3 ever going to beat a 1Ghz p3? You still only have 1000000000 clock cycles per second. Your processor bus isn't any faster, in fact it's got an extra processor on it so there's increased contention, you have a little bit more L2 cache to play with, your memory bandwidth is the same, the overhead from SMP guarantees that you'll never efficiently use more than about 900000000 of those clock cycles in a second, and all your drivers have to play it SMP-safe, which is another 5-10% speed hit.
x+x is not greater than x*2.
SMP helps partition applications from each other so even if one app is hogging a cpu, other stuff will still give decent response times. But that's about it - unless you need to push the bleeding edge (and spend >5k on a box), SMP is not cost effective.
Their MAJC5200 processor already does SMT, although they call it something like spatial computing. Check it out here
IBM demonstrated a multi-threaded POWER CPU destined for their AS/400 series of workstations and servers at the ISSCC conference back in 1998. The synopsis of their presentation is available here, as paper 15.3.
To my knowledge, this chip is either now in use, or very close to being put in an AS/400 or i Series box.
Hardware Central: Nice site layout. Pleasant writing. Very lightweight.
Simultaneous Marijuana Trafficking?
Back in the early-mid 70's people were taking two 6800 CPUs, wiring them out-of-phase, and essentially building tightly-coupled SMP systems. We didn't really have threading in those days, or else the correct OS could have made such a system SMT, instead.
But someone else was designing another CPU, called the COPS. They looked at this well-known out-of-phase 6800 technique, and realized that their design basically used clock-up for fetch/decode, and clock-down for execute. During each half-cycle, half the CPU was sitting idle.
So they doubled the registers, using the fetch/decode unit with one register-set during clock-down and the other register-set during clock-up. The execute unit worked in the converse fashion, alternating register sets. A dual CPU on a single chip for the cost of a second register set and a little control/arbitration logic. They didn't attempt any sophisticated contention-prevention, leaving that up to the software. This was mid-late 70's.
With more modern software, COPS might have been the first SMT. I don't know the timeframe of the CDC6000, whether it beats mid-late 70's or not.
The living have better things to do than to continue hating the dead.
Cheers... .*rc is
--
$HOME is where the
$HOME is where the
-- silver_p
The Blue Gene supercomputer uses SMT. Guess they don't just put multiple cores on a single chip :)
m
I wonder how many copies are actually used in a production environment.
I'd be willing to bet it's vaporware.
[Connection closed by foreign host]
Well, maybe not by the time I get this up. Actually, win98 and ME don't support dual processors at all, so you're second one will just be sitting on the motherboard turned off.
As far as SMT goes, I think it's a good idea (well, obviously, why wouldn't it be). You really can only get so much out of Instruction level parallelism, and I've always thought that splitting CPU time up by thread rather then by instruction parallelism would be a lot more effective.
Rate me on Picture-rate.com
"and dear god does this website suck now." -- CmdrTaco
actualy, every pic on there is 'default' Right now, my pic is at http://picture-rate.com:8080/hello/viewpic.jsp?sys tem=happy&owner=default&picture=i170, although, that could cange. I would link directly to it, but chad hasn't put in the ablity to select which picture to rate, otherwise I owuld put that link in my sig.
Rate me on Picture-rate.com
"and dear god does this website suck now." -- CmdrTaco
Neither W2k nor Linux are particularly impressive with 2+ cpu's. Use BeOS if you want to see what a dual processor machine is really capable of speed wise.
Comment removed based on user account deletion
Comment removed based on user account deletion
Actually, while the first bit is true... let's face it, for most business applications we do not need faster machines until we have to deal with the bloat of the next road of software from the major vendors.
Businesses could probably do very well on a single standardized set of software for a decade or more for most common functions. Many have done so as a matter of fact. There are some businesses out the still running win 3.1 apps.
"It is a greater offense to steal men's labor, than their clothes"
I didn:t know that slashdot had been "acquired" by Microshit's PR dept...
-- javaDragon is an instance of JavaDragon.
"I got the impression that SMP was still better".
A processor can be 4-way SMT for only 10% extra silicon cost. SMT makes processors more EFFICIENT.
C//
The POWER4 chip is a very impressive design and a very nice chip. EV8 will be far more efficient, however. Look at the transistor/silicon cost of the chips. POWER4 is a 2-way MPU designed for 4x2 transputer-like grid assemblies. Such a configuration will be insanely fast, but surely quite bleedingly expensive.
C//
An equivalent EV8 system should be less expensive to manufacturer, however IBM could possibly manage better economies of scale and will quite possibly out-market Alpha. Alpha and POWER4 will be competing in the same markets (HPC).
Ah, At long last circuitry catches-up with functionality that women have been laying claim to for aeons.
People who say it cannot be done should not interrupt those who are doing it.
- SMT = Society for Music Theory
- SMT = Surface Mount Technology
- SMT = Simultaneous MultiThreading
And (drumroll) of course(http://smt.ucsb.edu/smt-list/smt-main.html)
the Finland Travel Bureau (http://www.smt.fi/)
When will the madness end?
People who say it cannot be done should not interrupt those who are doing it.
This is because Win9x does not support SMP so even dual 933MHz will be out performed by a single 1GHz. A better comparison would with Win2k or Linux, which will both actually use both CPU's.
Please, this is not a new idea.
Here is an article from DEC (now Compaq) that describes how the Alpha chip does it:
http://www.compaq.com/hpc/ref/ref_alpha_ia64.doc
for word.
and
http://www.compaq.com/hpc/ref/ref_alpha_ia64.pdf
for pdf.
Digital has always tried to do things "the Right Way" and it shows in their products.
Never answer an anonymous letter. - Yogi Berra
Will this help our dental friends using a cobol program for office management, developed on Xenix and only recently ported to SCO OpenServer? We're using curses to display on terminals over serial lines.
Or how about the supply ordering system that runs on dos. We actualy bought a celeron based system just to run this one app and then found that the program wouldn't run because it had a stupid win-modem installed! Forget about using the ink-jet printer too.
The windows based version wouldn't load because the installer told us we had to up-grade from WindowsME to Windows 95 or better thought seriously about Linux/wine that's better isn't it?
SMT is not going to help anything we use run faster; so why should we spend the money? Why would anyone develope software for this technology? We don't even get Linux distros optimised for our athlons!
This is just blue smoke and mirrors to try to get the techno-junkies to shell-out a bunch of money for a new system that might actualy run their apps ten percent faster for four times the cost.
Apocalypse Cancelled, Sorry, No Ticket Refunds
From the article: If the hardware and software portions of SMT come together, then the question of "why do I need a 1.5 GHz processor?' may be answered in very short order
Those guys got it totally wrong. The question of "why do I need a 1.5 GHz processor?" stems from the fact that current processors are more than fast enough for many applications.
This question isn't answered by introducing a system with higher performance. IMHO it might be answered by new, attractive applications that need the better performance (speaker-independent speech recognition?).
C - the footgun of programming languages
and understand why under an OS such as Win 98 or ME, even a single Pentium III 1 GHz will handily outperform a dual-Pentium III 500 MHz setup. Wow... isn't that amazing? A 1 GHZ P3 will out perform dual 500mhz P3s in Win98/Me! WOW! One reason may be that Win98/ME doesn't support dual processors? Holy balls batman!
Casual Games/Downloads
I thought that the major processor companies had been working with multiple execution pipelines for years now. Doesn't that fall under the category of ILP? - Scott
The article gave the impression that SMT would do well to improve performance in a single-processor environment, but I wasn't clear on how SMT stacked up against SMP. I got the impression that SMP was still better. Say a 1ghz system with SMT vs a dual-500mhz SMP system, which is going to be more effective? I went from using my dual-433 celeron system with Win2000 (yes I know, MS is evil...) to testing out someone elses P4 1.5ghz box and I wasn't impressed at all. The same CPU-intensive tasks seemed to run at about the same speed, if not slower. In all honesty, most computer users (I know, not us, but TYPICAL users) would be fine with a moderate (read as 300 to 400 mhz) processor provided they had sufficient memory. I do believe that whoever mentioned making software that is less processor-intensive hit the nail on the head.
If any of this appears incoherent, assume that the writer was drunk.