Ars Dissects POWER5, UltraSparc IV, and Efficeon
Burton Max writes "There's an interesting article here at Ars about the POWER5, UltraSparc IV, and Efficeon CPUs. It's a self-styled "overview of three specific upcoming processors: IBM's POWER5, Sun's UltraSparc IV, and Transmeta's Efficeon. " I found the insights as to Efficeon (successor to Crusoe) to be particularly good (although it paints a sad picture of Transmeta, methinks)."
Too bad they focused too much on Power and Transmeta while paying little time on UltraSparc IV and V and ignored Itanium. Needs a little more balance and it would have been a great read.
I don't drink because I have to, I drink to stop the voices in my head!
Since day 1 they have skirted the benchmark issue, always trying to deflect the question.
Just like that article yesterday on their new chip. Did they ever cite a single benchmark? NO.
The basic performance of your CPU product, as measured by industry standard benchmarks, is essential knowledge.
I was under NDA on the previous gen Transmeta stuff. It was amusing how the other OEMs reacted - it was crap, but nobody could say anything in public.
Why the heck did Sun's offering get thrown in there? For variety? The Efficeons look awful nice to people who want less power-hunger from their computing devices. If all you do is word processing and such, why the heck even use an Intel/AMD chip? Less heat, less power, what is not to love? Now the IBM chips have really piqued my interest, I am a huge fan of IBM's chips, especially in Apple computers (I am a proud owner of a 12" Powerbook).
I hate sigs.
Will show up as _4_ processors to the OS! (2 cores both doing SMT.)
:o)
This means that in a (say) 512 processor box the OS will have to handle 2048 processors efficiently. That's placing a lot of control in the hands of the software designers, and a lot of money in the hands of the companies that license per processor.
On the other hand, UNIX is getting pretty efficnelt at scaling to large systems, perhaps it (and by extension Linux thanks to SGI and IBM) will be able to handle it with no problems. One thread per processor on a desktop system might prove to be quite efficient
Beep beep.
The history of Wintel suggests that top-rated raw CPU performance is not the best predictor of adoption. Compatibility with market-dominating software platforms is a greater determinant of CPU sales. We might hope that advances in compiler design adn flexible cores can help any CPU run x86 code, but there are always the little nts that prevent true compatibility and drive computer buyers toward the dominant platform.
Two wrongs don't make a right, but three lefts do.
It's amusing seeing this. It reflects mostly that Microsoft has finally managed to ship in volume OSs that can do more than one thing at a time. (Bear in mind that most of Microsoft's installed base is still Windows 95/98/ME. Transitioning the customer base to NT/Win2K/XP has gone much more slowly than planned.)
But Microsoft takes the position that if have multiple CPUs, you have to pay more to run their software. So these strange beasts with multiple decoders sharing ALU resources emerge.
Wasn't low power consumption the number 1 benefit that transmeta was looking to provide, so that you could get twice the battery life (or soemthing like that) without sacrificing too much performance. Did Transmeta shoot itself in the foot by letting people think that it was going to provide higher performance chips than the competition.
The main selling point of transmeta was always power consumption, so have they lost their edge in that area? If so, then that would be serious for them, but the article doesn't answer that question.
the author suggests that it's not worth "pissing off Intel" to go with Transmeta. Give me a break. Transmeta is the only thing pushing Intel to make Centrino and other lower-wattage chips. They recognize that anybody in the mobile computing/devices world will seriously consider anything that gives their customers increased battery life and less toasty pockets.
There exists no way of exchanging information without making judgments. --Bene Gesserit Axiom
Multiple times while reviewing the Efficion architecture the article's author suggests that the tradeoff of additional storage required for Transmeta's code-morphing approach will easily balance out the power savings from making a simpler CPU. This belies a deep misunderstanding of power consumption in digital systems, as readily evidences by the fact that modern non-Transmeta processers dissipate multiple tens of Watts of power (often nearly 100W) and a full complement of memory (4G, in modern machines) dissipates a few Watts at most.
Also in the article, the author suggests that processors spend most of their time wating on loads, and then argues that since the code-morphing approach means more instruction fetches, the Efficion processor will be spending disproportionatly more time on loads. Then, after this assertion, he admits that he does not know *where* the translated Efficion code is held. Might it be in one-cycle-accessible L1 cache? That point is conveniently sidestepped. He does not understand under what circumstances the profiling takes place, although he regurgitates the sales pitch nicely. He argues that transistors hold the translated code (trying to argue against the transistors-for-software tradeoff) but then does not realize that transistors in memory do not equate transistors in logic (neither in power, as they are not cycled as frequently, nor in speed characteristics).
In all, I find the author's treatment of the Transmeta architecture sophomoric, and, after finding that section lacking, I left the rest of the article unread. Your mileage may vary.
Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
first, you don't just automatically get a linear increase with the width of the multiple-threading capabilities. it's not like it's free to increase the RF size and/or FUs, etc.
you're also confusing contexts with active threads. the Tera^WCray MTA had 128 contexts available -- so that thread switching is more light-weight, more or less -- but only one could be active at one time.
SMT in the various forms have more than one active thread, which introduces the problem(s) of competing for resources in the issue and retire stages, etc et al.
I actually read the article!!!!!
All my questions were answered so I have nothing to say.
Ok, so you are worried that your parts are no longer accessable.
:-) The next step is to somehow connect multiple (4/8/16/whatever) CPUs along with large (multi-meg) caches together using these specialized interconnect technologies.
One of the first computers I built had individual TTL parts (74xx type things) to make the CPU. If I fried on of those, I would just replace that single part and be going again. No need to replace the whole CPU.
I, for one, would never go back to that. Not just the size but the performance and the cost.
It used to be that I would buy 4K-bit RAM chips. Buy 8 of those to make a 8x4K RAM array (4K bytes) and then add a simple address decoder and put it on an S100 bus and you have more RAM for you system. Now I buy 512Meg DIMM modules where you can't (and don't want to) replace the individual chips. (Ok, you could if you have the fancy tools and you could get the chips in question but the cost factors just don't make that worth while)
Systems are getting faster because of the higher integration levels. Taking the off-chip caches (like systems build with 386 and 486 CPUs) and putting that onto the CPU (first in the P-Pro as multi-chip modules and then later, onto the CPU itself) has significantly improved performance. Yes, it has removed ability to replace the cache separate from the CPU but then who really wants to or needs to do that. And at what cost (go back to 100MHz cache memory interfaces? I like the 3GHz clock in my on-chip cache, thank you very much.)
The same is true of multi-cpu systems. As you increase the performance, the communications performance becomes a major bottle neck. First IBM put two CPU cores on one chip. Then Intel did a thing called Hyper-Threading (after they said that dual cores have no value
Imagine the performance gain of having 1GHz+ clocked cache of, say, 256Meg connected to 8 really fast CPU cores. It would be as much of a step forward as going from my TTL based 8-bit CPU to the 6502 single-chip CPU.
I know I would not want to go back... So lets investigate how to move forward.
In other words, you're laying out the basic problems of:
1) Being able to FIND parallelism
2) Being able to take advantage of it:
a) Issuing multiple instructions (limited fetch bandwidth)
b) Executing them in parallel (limited FUs)
c) Committing them to memory / retiring
20% is generous, but that's a limitation of the simplicity of HT with respect to the EV8 / UltraSparc-V scale of SMT implementation, which leans towards a more full-issue design.
Ehh. In my opinion, people overestimate how big a deal x86 architecture complexity is, in part because it flatters their preconception that Intel is evil. ("If only dastardly Intel hadn't been holding the world back with this demon architecture from hell, think how fast CPUs could be now!") While working at VMware, I've gotten to know the x86 architecture on a first name basis. He now lets me call him "Archie."
While Archie is undoubtedly an ugly, drunk screw-up, he's really a droplet in the ocean of effort that goes into a competitive CPU implementation. Yeah, we've got lots of code to deal with him, and he's an ongoing source of work, but not all that much code, nor that much work. If Archie were really such a terrible guy, it wouldn't be possible for Intel and AMD to be eating so many RISC vendors' lunches.
Mike Johnson, the lead x86 designer at AMD, probably put it most succinctly when he said, "The x86 isn't all that complex -- it just doesn't make a lot of sense." It's peculiar all right, but not so peculiar that it can explain Transmeta's failure to be performance competitive. From speaking with Transmetans, I get the strong impression that they got bogged down because making a high performance dynamic translation system is ridiculously hard, rather than, say, because they just couldn't get the growdown segment descriptors right.
This is a false economy. Just because you have 32 threads to run doesn't mean you would benefit from 32-way SMT. Remember that you don't just need 32 contexts in your CPU, you need enough cache to be able to feed 32 unrelated threads. The reason SMT sometimes slows down a CPU is that the 2 or more threads running concurrently compete for cache space. If you just run a single thread at a time, it has a whole quantum to fill up the cache and use it.
The way this worked on the afforementioned MTA machine is that the processor had 128 contexts and NO cache. Memory latency was 32 cycles, so as long as you had at least 32 compute-bound threads, there were no cycles lost to latency. However, this means that each thread did take longer to run.
aQazaQa