Ars Dissects POWER5, UltraSparc IV, and Efficeon

Good article by The_Ronin · 2003-11-20 08:02 · Score: 5, Interesting

Too bad they focused too much on Power and Transmeta while paying little time on UltraSparc IV and V and ignored Itanium. Needs a little more balance and it would have been a great read.

--

I don't drink because I have to, I drink to stop the voices in my head!

Re:Good article by AKAImBatman · 2003-11-20 08:16 · Score: 5, Interesting

I think it would have been best to have an article devoted to the TransMeta chip, and split the Power5/UltraSparc discussion out into its own article. That way he could have given a great deal more attention to the powerhouse chips and how they're going to change the future. TransMeta's chips are on the level of ARM, not UltraSparc.

--
Javascript + Nintendo DSi = DSiCade
Re:Good article by Anonymous Coward · 2003-11-20 08:30 · Score: 3, Informative

There's a reason they ignored Itanium, it's about upcoming processor technologies. Last I checked there wasn't a new, soon to be released, Itanium that Intel was pushing.

In fact, the current Intel processor roadmap shows the same Itanium 2 processor for the first half of 2004 as it did for the second half of 2003.
Re:Good article by jaberwaki · 2003-11-20 08:39 · Score: 3, Informative

I believe he didn't spend much time on the UltraSparc IV because, quote:

"To get the "hyperthreading" effect of two processors on one chip, Sun stuck two full-blown UltraSparc III cores on a single chip, which is chip-pin compatible with the UltraSparc III."

He assumes the interested reader will already know something about the UltraSparc III. Sun didn't fundamentally change the chip architecture. Also the Itanium architecture is already discussed ad-nauseum in other articles. It wasn't meant to be a balanced overview of all new CPU architectures.
Re:Good article by pmz · 2003-11-20 12:22 · Score: 3, Informative

Sun didn't fundamentally change the chip architecture.

Probably the most significant outcome of the USIV will be 212-CPU Sun Fire 15K servers. That seems to imply something like 5 or 6 CPUs per rack-unit (although it appears the 15K is somewhat bigger than a standard rack).

--
Healthcare article at Kuro5hin

Transmeta is a joke by Anonymous Coward · 2003-11-20 08:04 · Score: 3, Insightful

Since day 1 they have skirted the benchmark issue, always trying to deflect the question.

Just like that article yesterday on their new chip. Did they ever cite a single benchmark? NO.

The basic performance of your CPU product, as measured by industry standard benchmarks, is essential knowledge.

I was under NDA on the previous gen Transmeta stuff. It was amusing how the other OEMs reacted - it was crap, but nobody could say anything in public.

Sun? by Raven42rac · 2003-11-20 08:08 · Score: 3, Interesting

Why the heck did Sun's offering get thrown in there? For variety? The Efficeons look awful nice to people who want less power-hunger from their computing devices. If all you do is word processing and such, why the heck even use an Intel/AMD chip? Less heat, less power, what is not to love? Now the IBM chips have really piqued my interest, I am a huge fan of IBM's chips, especially in Apple computers (I am a proud owner of a 12" Powerbook).

--
I hate sigs.

Re:Sun? by illumin8 · 2003-11-20 09:50 · Score: 4, Insightful

I am a huge fan of IBM's chips, especially in Apple computers (I am a proud owner of a 12" Powerbook).

I don't mean to burst your bubble, but your 12" PowerBook uses a Motorola processor, not an IBM one. I own a 15" PowerBook though and I love it.

That having been said, the IBM PPC 970 or G5 is breathing new life into the PowerMac line and Apple is doing really well because of it. I can't wait until they get it stuffed into a PowerBook.

--
"When the president does it, that means it's not illegal." - Richard M. Nixon

One Power 5... by Realistic_Dragon · 2003-11-20 08:09 · Score: 4, Interesting

Will show up as _4_ processors to the OS! (2 cores both doing SMT.)

This means that in a (say) 512 processor box the OS will have to handle 2048 processors efficiently. That's placing a lot of control in the hands of the software designers, and a lot of money in the hands of the companies that license per processor.

On the other hand, UNIX is getting pretty efficnelt at scaling to large systems, perhaps it (and by extension Linux thanks to SGI and IBM) will be able to handle it with no problems. One thread per processor on a desktop system might prove to be quite efficient :o)

--
Beep beep.

Re:One Power 5... by stevesliva · 2003-11-20 08:17 · Score: 3, Interesting

I'm getting a lot of karma mileage from this Power5 MCM review these days. They visited the same Microprocessor Forum that Ars did.

--
Who do you get to be an expert to tell you something's not obvious? The least insightful person you can find? -J Roberts
Re:One Power 5... by isaac · 2003-11-20 09:57 · Score: 3, Informative

So, IBM is taking away the ability to hot swap individual chips in exchange for... what? That's the big question. If there's some major improvement in the design, say so! Inquiring minds want to know! :-)
Damn, dude, RTFA if you're that curious!
What is gained is full-speed interconnect between processors within the same module. No "multipliers" - the bus between the cores within the module run at chip speeds. The timings are so tight at 2+ GHz that this is simply impossible to do with individual chips.
-Isaac

--
I am not a lawyer, and this is not legal advice. For Entertainment Purposes Only.

Performance != marketshare by G4from128k · 2003-11-20 08:11 · Score: 4, Insightful

The history of Wintel suggests that top-rated raw CPU performance is not the best predictor of adoption. Compatibility with market-dominating software platforms is a greater determinant of CPU sales. We might hope that advances in compiler design adn flexible cores can help any CPU run x86 code, but there are always the little nts that prevent true compatibility and drive computer buyers toward the dominant platform.

--
Two wrongs don't make a right, but three lefts do.

The "hyperthreading" thing. by Animats · 2003-11-20 08:18 · Score: 3, Interesting

First "Hyperthreading", now "prioritized hyperthreading".

It's amusing seeing this. It reflects mostly that Microsoft has finally managed to ship in volume OSs that can do more than one thing at a time. (Bear in mind that most of Microsoft's installed base is still Windows 95/98/ME. Transitioning the customer base to NT/Win2K/XP has gone much more slowly than planned.)

But Microsoft takes the position that if have multiple CPUs, you have to pay more to run their software. So these strange beasts with multiple decoders sharing ALU resources emerge.

power consumption by bigpat · 2003-11-20 08:22 · Score: 4, Interesting

Wasn't low power consumption the number 1 benefit that transmeta was looking to provide, so that you could get twice the battery life (or soemthing like that) without sacrificing too much performance. Did Transmeta shoot itself in the foot by letting people think that it was going to provide higher performance chips than the competition.

The main selling point of transmeta was always power consumption, so have they lost their edge in that area? If so, then that would be serious for them, but the article doesn't answer that question.

So, despite being lower voltage/MIPS... by csoto · 2003-11-20 08:51 · Score: 5, Interesting

the author suggests that it's not worth "pissing off Intel" to go with Transmeta. Give me a break. Transmeta is the only thing pushing Intel to make Centrino and other lower-wattage chips. They recognize that anybody in the mobile computing/devices world will seriously consider anything that gives their customers increased battery life and less toasty pockets.

--
There exists no way of exchanging information without making judgments. --Bene Gesserit Axiom

Re:So, despite being lower voltage/MIPS... by curtlewis · 2003-11-20 08:59 · Score: 3, Insightful

Centrino is not a chip!

it's a package of intel wireless, intel cpu and some other stuff.

memory and processor watts not the same by pz · 2003-11-20 08:53 · Score: 5, Interesting

Multiple times while reviewing the Efficion architecture the article's author suggests that the tradeoff of additional storage required for Transmeta's code-morphing approach will easily balance out the power savings from making a simpler CPU. This belies a deep misunderstanding of power consumption in digital systems, as readily evidences by the fact that modern non-Transmeta processers dissipate multiple tens of Watts of power (often nearly 100W) and a full complement of memory (4G, in modern machines) dissipates a few Watts at most.

Also in the article, the author suggests that processors spend most of their time wating on loads, and then argues that since the code-morphing approach means more instruction fetches, the Efficion processor will be spending disproportionatly more time on loads. Then, after this assertion, he admits that he does not know *where* the translated Efficion code is held. Might it be in one-cycle-accessible L1 cache? That point is conveniently sidestepped. He does not understand under what circumstances the profiling takes place, although he regurgitates the sales pitch nicely. He argues that transistors hold the translated code (trying to argue against the transistors-for-software tradeoff) but then does not realize that transistors in memory do not equate transistors in logic (neither in power, as they are not cycled as frequently, nor in speed characteristics).

In all, I find the author's treatment of the Transmeta architecture sophomoric, and, after finding that section lacking, I left the rest of the article unread. Your mileage may vary.

--

Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.

Re:memory and processor watts not the same by Hannibal_Ars · 2003-11-20 09:53 · Score: 5, Informative

"Multiple times while reviewing the Efficion architecture the article's author suggests that the tradeoff of additional storage required for Transmeta's code-morphing approach will easily balance out the power savings from making a simpler CPU."

I neither suggest nor imply anything this simplistic. In fact, I go to great pains to show how complicated the whole power picture is for Efficeon.

"This belies a deep misunderstanding of power consumption in digital systems, as readily evidences by the fact that modern non-Transmeta processers dissipate multiple tens of Watts of power (often nearly 100W) and a full complement of memory (4G, in modern machines) dissipates a few Watts at most."

Er... you do realize, don't you, that comparing Efficeon to a 100W processor is not only unfair, but it's stupid and I didn't do it anywhere in the article. A more appropriate comparison is Centrino, which approaches Efficeon in MIPS/Watt without any help at all from any kind of CMS software. I think that you might be the one who needs to learn a bit more about digital systems.

"Also in the article, the author suggests that processors spend most of their time wating on loads, and then argues that since the code-morphing approach means more instruction fetches, the Efficion processor will be spending disproportionatly more time on loads. Then, after this assertion, he admits that he does not know *where* the translated Efficion code is held. Might it be in one-cycle-accessible L1 cache? "

No, it is most certainly all not stored in L1. TM claimed that the original CMS software that came with Crusoe took up about 16MB of RAM, and that this was paged in from a flash module on boot. What I'm not 100% certain of are the exact specs for Efficeon, but I've assumed in this article that they're similar. This is a reasonable assumption, especially given the fact that the new version of CMS contains significant enhancements and is unlikely to be smaller. In fact, it's much more likely to be larger than the original 16MB CMS footprint, especially given that DRAM modules have increased in speed and decreased in cost/MB, which gives TM more headroom and flexibility to increase the code size a bit.

"That point is conveniently sidestepped. He does not understand under what circumstances the profiling takes place, although he regurgitates the sales pitch nicely. He argues that transistors hold the translated code (trying to argue against the transistors-for-software tradeoff) but then does not realize that transistors in memory do not equate transistors in logic (neither in power, as they are not cycled as frequently, nor in speed characteristics)."

Of course I know that transistors in memory are not the same as transistors on the CPU. My point though is that they're still not "free" in terms of power draw, and that it also costs power to both page CMS into RAM and to move it from RAM to the L1. And even having pointed that out, I still don't claim that this cancells out all the power saving advantages of TM's approach.

As far as relying on the sales pitch for info on CMS's profiling, well, TM doesn't exactly release the source for CMS, nor do they provide a detailed user manual for it avialable to the public. As their core technology, details about CMS are highly guarded and the only information that either you or I will likely ever have access to about it is whatever they put in the sales pitch. So I, like everyone else, must draw inferences from their presentations and do the best I can.

Anyway, if you don't like the article, that's fine. But being a hater about it just makes you look lame.

--
Senior CPU Editor | Ars Technica | http://arstechnica.com/
Re:memory and processor watts not the same by PastaAnta · 2003-11-20 12:43 · Score: 3, Insightful

First of all I thank you for a great article. You have som interesting views on the Transmeta approach. But like the parent poster I feel you may jump to some conclusions based on assumptions.

It is true, that the CMS has a cost in terms of RAM usage but this does not necessarily translate into extra load latency. As I have understood the clue should be to utilize the fact that in common code you only execute a very little portion of the code most of the time (like 90%/10% or whatever). It should be expected that much can be gained by heavily optimizing these "inner loops", which should translate into reduced load latency as fewer instruction will be executed in total. The execution of the four optimisation runs or JIT compilation should drown in the millions of times these inner loops are executed.

You could say that it is a complete waste of transistors and power usage to have many transistors performing the same optimisation over and over again in the conventional processors. These hardware based optimisation will also never be as efficient as their scope is limited.

There are some interesting perspectives with the Transmeta approach as well. You state that POWER5, UltraSparcIV and Prescott tacle the problem with load latency by using SMT to fill pipeline bubbles from data stalls and thereby increase utilisation of the execution units. This should be possible for Transmeta as well, by upgrading their CMS to emulate two logic processors instead of one.

But you are right! A complete theoretical comparison is impossible - only real world experience will show...

contexts != threads by kcm · 2003-11-20 09:03 · Score: 5, Informative

first, you don't just automatically get a linear increase with the width of the multiple-threading capabilities. it's not like it's free to increase the RF size and/or FUs, etc.

you're also confusing contexts with active threads. the Tera^WCray MTA had 128 contexts available -- so that thread switching is more light-weight, more or less -- but only one could be active at one time.

SMT in the various forms have more than one active thread, which introduces the problem(s) of competing for resources in the issue and retire stages, etc et al.

Guess what??? by crgrace · 2003-11-20 09:43 · Score: 5, Funny

I actually read the article!!!!!

All my questions were answered so I have nothing to say.

Re:One Power 5... (Just a matter of scale...) by Anonymous Coward · 2003-11-20 09:46 · Score: 3, Insightful

Ok, so you are worried that your parts are no longer accessable.

One of the first computers I built had individual TTL parts (74xx type things) to make the CPU. If I fried on of those, I would just replace that single part and be going again. No need to replace the whole CPU.

I, for one, would never go back to that. Not just the size but the performance and the cost.

It used to be that I would buy 4K-bit RAM chips. Buy 8 of those to make a 8x4K RAM array (4K bytes) and then add a simple address decoder and put it on an S100 bus and you have more RAM for you system. Now I buy 512Meg DIMM modules where you can't (and don't want to) replace the individual chips. (Ok, you could if you have the fancy tools and you could get the chips in question but the cost factors just don't make that worth while)

Systems are getting faster because of the higher integration levels. Taking the off-chip caches (like systems build with 386 and 486 CPUs) and putting that onto the CPU (first in the P-Pro as multi-chip modules and then later, onto the CPU itself) has significantly improved performance. Yes, it has removed ability to replace the cache separate from the CPU but then who really wants to or needs to do that. And at what cost (go back to 100MHz cache memory interfaces? I like the 3GHz clock in my on-chip cache, thank you very much.)

The same is true of multi-cpu systems. As you increase the performance, the communications performance becomes a major bottle neck. First IBM put two CPU cores on one chip. Then Intel did a thing called Hyper-Threading (after they said that dual cores have no value :-) The next step is to somehow connect multiple (4/8/16/whatever) CPUs along with large (multi-meg) caches together using these specialized interconnect technologies.

Imagine the performance gain of having 1GHz+ clocked cache of, say, 256Meg connected to 8 really fast CPU cores. It would be as much of a step forward as going from my TTL based 8-bit CPU to the 6502 single-chip CPU.

I know I would not want to go back... So lets investigate how to move forward.

Re:Why only two threads per core? by kcm · 2003-11-20 09:59 · Score: 3, Interesting

In other words, you're laying out the basic problems of:

1) Being able to FIND parallelism
2) Being able to take advantage of it:
a) Issuing multiple instructions (limited fetch bandwidth)
b) Executing them in parallel (limited FUs)
c) Committing them to memory / retiring

20% is generous, but that's a limitation of the simplicity of HT with respect to the EV8 / UltraSparc-V scale of SMT implementation, which leans towards a more full-issue design.

Re:Code has to be loaded anyway by kma · 2003-11-20 15:59 · Score: 3, Informative

Ehh. In my opinion, people overestimate how big a deal x86 architecture complexity is, in part because it flatters their preconception that Intel is evil. ("If only dastardly Intel hadn't been holding the world back with this demon architecture from hell, think how fast CPUs could be now!") While working at VMware, I've gotten to know the x86 architecture on a first name basis. He now lets me call him "Archie."

While Archie is undoubtedly an ugly, drunk screw-up, he's really a droplet in the ocean of effort that goes into a competitive CPU implementation. Yeah, we've got lots of code to deal with him, and he's an ongoing source of work, but not all that much code, nor that much work. If Archie were really such a terrible guy, it wouldn't be possible for Intel and AMD to be eating so many RISC vendors' lunches.

Mike Johnson, the lead x86 designer at AMD, probably put it most succinctly when he said, "The x86 isn't all that complex -- it just doesn't make a lot of sense." It's peculiar all right, but not so peculiar that it can explain Transmeta's failure to be performance competitive. From speaking with Transmetans, I get the strong impression that they got bogged down because making a high performance dynamic translation system is ridiculously hard, rather than, say, because they just couldn't get the growdown segment descriptors right.

Re:Why only two threads per core? by Anonymous Coward · 2003-11-20 16:27 · Score: 3, Insightful

This is a false economy. Just because you have 32 threads to run doesn't mean you would benefit from 32-way SMT. Remember that you don't just need 32 contexts in your CPU, you need enough cache to be able to feed 32 unrelated threads. The reason SMT sometimes slows down a CPU is that the 2 or more threads running concurrently compete for cache space. If you just run a single thread at a time, it has a whole quantum to fill up the cache and use it.

The way this worked on the afforementioned MTA machine is that the processor had 128 contexts and NO cache. Memory latency was 32 cycles, so as long as you had at least 32 compute-bound threads, there were no cycles lost to latency. However, this means that each thread did take longer to run.

aQazaQa

Slashdot Mirror

Ars Dissects POWER5, UltraSparc IV, and Efficeon

25 of 176 comments (clear)