Ars Dissects POWER5, UltraSparc IV, and Efficeon
Burton Max writes "There's an interesting article here at Ars about the POWER5, UltraSparc IV, and Efficeon CPUs. It's a self-styled "overview of three specific upcoming processors: IBM's POWER5, Sun's UltraSparc IV, and Transmeta's Efficeon. " I found the insights as to Efficeon (successor to Crusoe) to be particularly good (although it paints a sad picture of Transmeta, methinks)."
The POWER 4 based systems from IBM are only available up to 32-way. I'd expect them to try to double it to 64-way. So OS would see 128 processors with the POWER 5.
There's a reason they ignored Itanium, it's about upcoming processor technologies. Last I checked there wasn't a new, soon to be released, Itanium that Intel was pushing.
In fact, the current Intel processor roadmap shows the same Itanium 2 processor for the first half of 2004 as it did for the second half of 2003.
It's four dual-core SMT processor chips, and four L3 cache chips per MCM, actually. I think cache is sexy, but I'm biased.
Who do you get to be an expert to tell you something's not obvious? The least insightful person you can find? -J Roberts
I believe he didn't spend much time on the UltraSparc IV because, quote:
"To get the "hyperthreading" effect of two processors on one chip, Sun stuck two full-blown UltraSparc III cores on a single chip, which is chip-pin compatible with the UltraSparc III."
He assumes the interested reader will already know something about the UltraSparc III. Sun didn't fundamentally change the chip architecture. Also the Itanium architecture is already discussed ad-nauseum in other articles. It wasn't meant to be a balanced overview of all new CPU architectures.
first, you don't just automatically get a linear increase with the width of the multiple-threading capabilities. it's not like it's free to increase the RF size and/or FUs, etc.
you're also confusing contexts with active threads. the Tera^WCray MTA had 128 contexts available -- so that thread switching is more light-weight, more or less -- but only one could be active at one time.
SMT in the various forms have more than one active thread, which introduces the problem(s) of competing for resources in the issue and retire stages, etc et al.
Subsequently, I don't know how much you've played with Pentium IVs but they don't buy much in most circumstances. We're not talking about doubling performance or anything like that. If one SMT unit get's you a 20% improvment on a p4 in the best cases then what does the 4th unit buy? IBM is just hedging because the technology hasn't shown that it delivers serious punch yet.
"Multiple times while reviewing the Efficion architecture the article's author suggests that the tradeoff of additional storage required for Transmeta's code-morphing approach will easily balance out the power savings from making a simpler CPU."
I neither suggest nor imply anything this simplistic. In fact, I go to great pains to show how complicated the whole power picture is for Efficeon.
"This belies a deep misunderstanding of power consumption in digital systems, as readily evidences by the fact that modern non-Transmeta processers dissipate multiple tens of Watts of power (often nearly 100W) and a full complement of memory (4G, in modern machines) dissipates a few Watts at most."
Er... you do realize, don't you, that comparing Efficeon to a 100W processor is not only unfair, but it's stupid and I didn't do it anywhere in the article. A more appropriate comparison is Centrino, which approaches Efficeon in MIPS/Watt without any help at all from any kind of CMS software. I think that you might be the one who needs to learn a bit more about digital systems.
"Also in the article, the author suggests that processors spend most of their time wating on loads, and then argues that since the code-morphing approach means more instruction fetches, the Efficion processor will be spending disproportionatly more time on loads. Then, after this assertion, he admits that he does not know *where* the translated Efficion code is held. Might it be in one-cycle-accessible L1 cache? "
No, it is most certainly all not stored in L1. TM claimed that the original CMS software that came with Crusoe took up about 16MB of RAM, and that this was paged in from a flash module on boot. What I'm not 100% certain of are the exact specs for Efficeon, but I've assumed in this article that they're similar. This is a reasonable assumption, especially given the fact that the new version of CMS contains significant enhancements and is unlikely to be smaller. In fact, it's much more likely to be larger than the original 16MB CMS footprint, especially given that DRAM modules have increased in speed and decreased in cost/MB, which gives TM more headroom and flexibility to increase the code size a bit.
"That point is conveniently sidestepped. He does not understand under what circumstances the profiling takes place, although he regurgitates the sales pitch nicely. He argues that transistors hold the translated code (trying to argue against the transistors-for-software tradeoff) but then does not realize that transistors in memory do not equate transistors in logic (neither in power, as they are not cycled as frequently, nor in speed characteristics)."
Of course I know that transistors in memory are not the same as transistors on the CPU. My point though is that they're still not "free" in terms of power draw, and that it also costs power to both page CMS into RAM and to move it from RAM to the L1. And even having pointed that out, I still don't claim that this cancells out all the power saving advantages of TM's approach.
As far as relying on the sales pitch for info on CMS's profiling, well, TM doesn't exactly release the source for CMS, nor do they provide a detailed user manual for it avialable to the public. As their core technology, details about CMS are highly guarded and the only information that either you or I will likely ever have access to about it is whatever they put in the sales pitch. So I, like everyone else, must draw inferences from their presentations and do the best I can.
Anyway, if you don't like the article, that's fine. But being a hater about it just makes you look lame.
Senior CPU Editor | Ars Technica | http://arstechnica.com/
What is gained is full-speed interconnect between processors within the same module. No "multipliers" - the bus between the cores within the module run at chip speeds. The timings are so tight at 2+ GHz that this is simply impossible to do with individual chips.
-Isaac
I am not a lawyer, and this is not legal advice. For Entertainment Purposes Only.
MS has shipped preemptive multitasking and multithreading for a long time. You are confusing that with multiprocessing (which is different).
Win95/98/ME are not multiprocessor but are preemptive multitasking and multithreading. They can certainly do "more than one thing at a time". Unlike Apple who first shipped this capability only recently, MS first shipped this in Windows 386 back in the late 80's.
Sun didn't fundamentally change the chip architecture.
Probably the most significant outcome of the USIV will be 212-CPU Sun Fire 15K servers. That seems to imply something like 5 or 6 CPUs per rack-unit (although it appears the 15K is somewhat bigger than a standard rack).
Healthcare article at Kuro5hin
Ehh. In my opinion, people overestimate how big a deal x86 architecture complexity is, in part because it flatters their preconception that Intel is evil. ("If only dastardly Intel hadn't been holding the world back with this demon architecture from hell, think how fast CPUs could be now!") While working at VMware, I've gotten to know the x86 architecture on a first name basis. He now lets me call him "Archie."
While Archie is undoubtedly an ugly, drunk screw-up, he's really a droplet in the ocean of effort that goes into a competitive CPU implementation. Yeah, we've got lots of code to deal with him, and he's an ongoing source of work, but not all that much code, nor that much work. If Archie were really such a terrible guy, it wouldn't be possible for Intel and AMD to be eating so many RISC vendors' lunches.
Mike Johnson, the lead x86 designer at AMD, probably put it most succinctly when he said, "The x86 isn't all that complex -- it just doesn't make a lot of sense." It's peculiar all right, but not so peculiar that it can explain Transmeta's failure to be performance competitive. From speaking with Transmetans, I get the strong impression that they got bogged down because making a high performance dynamic translation system is ridiculously hard, rather than, say, because they just couldn't get the growdown segment descriptors right.