Ars Dissects POWER5, UltraSparc IV, and Efficeon
Burton Max writes "There's an interesting article here at Ars about the POWER5, UltraSparc IV, and Efficeon CPUs. It's a self-styled "overview of three specific upcoming processors: IBM's POWER5, Sun's UltraSparc IV, and Transmeta's Efficeon. " I found the insights as to Efficeon (successor to Crusoe) to be particularly good (although it paints a sad picture of Transmeta, methinks)."
Since day 1 they have skirted the benchmark issue, always trying to deflect the question.
Just like that article yesterday on their new chip. Did they ever cite a single benchmark? NO.
The basic performance of your CPU product, as measured by industry standard benchmarks, is essential knowledge.
I was under NDA on the previous gen Transmeta stuff. It was amusing how the other OEMs reacted - it was crap, but nobody could say anything in public.
The history of Wintel suggests that top-rated raw CPU performance is not the best predictor of adoption. Compatibility with market-dominating software platforms is a greater determinant of CPU sales. We might hope that advances in compiler design adn flexible cores can help any CPU run x86 code, but there are always the little nts that prevent true compatibility and drive computer buyers toward the dominant platform.
Two wrongs don't make a right, but three lefts do.
This means that in a (say) 512 processor box the OS will have to handle 2048 processors efficiently. That's placing a lot of control in the hands of the software designers, and a lot of money in the hands of the companies that license per processor.
Fortunately for IBM, they are both the hardware designers and, frequently, the software designers. They can ensure that their big iron will be supported by software.
There are no trails. There are no trees out here.
Alright, 4 two-way chips. But does it actually improve anything over individual processors? If I have to yank a board on an UltraSparc, I'm not going to throw away the entire board and all its processors! I'm simply going to replace the bad one and slap the board right back in the system. With IBM's design, I have to throw the whole thing away and get a new block of cement^W^W^W processor chip for my machine.
Javascript + Nintendo DSi = DSiCade
Centrino is not a chip!
it's a package of intel wireless, intel cpu and some other stuff.
Ok, so you are worried that your parts are no longer accessable.
:-) The next step is to somehow connect multiple (4/8/16/whatever) CPUs along with large (multi-meg) caches together using these specialized interconnect technologies.
One of the first computers I built had individual TTL parts (74xx type things) to make the CPU. If I fried on of those, I would just replace that single part and be going again. No need to replace the whole CPU.
I, for one, would never go back to that. Not just the size but the performance and the cost.
It used to be that I would buy 4K-bit RAM chips. Buy 8 of those to make a 8x4K RAM array (4K bytes) and then add a simple address decoder and put it on an S100 bus and you have more RAM for you system. Now I buy 512Meg DIMM modules where you can't (and don't want to) replace the individual chips. (Ok, you could if you have the fancy tools and you could get the chips in question but the cost factors just don't make that worth while)
Systems are getting faster because of the higher integration levels. Taking the off-chip caches (like systems build with 386 and 486 CPUs) and putting that onto the CPU (first in the P-Pro as multi-chip modules and then later, onto the CPU itself) has significantly improved performance. Yes, it has removed ability to replace the cache separate from the CPU but then who really wants to or needs to do that. And at what cost (go back to 100MHz cache memory interfaces? I like the 3GHz clock in my on-chip cache, thank you very much.)
The same is true of multi-cpu systems. As you increase the performance, the communications performance becomes a major bottle neck. First IBM put two CPU cores on one chip. Then Intel did a thing called Hyper-Threading (after they said that dual cores have no value
Imagine the performance gain of having 1GHz+ clocked cache of, say, 256Meg connected to 8 really fast CPU cores. It would be as much of a step forward as going from my TTL based 8-bit CPU to the 6502 single-chip CPU.
I know I would not want to go back... So lets investigate how to move forward.
I am a huge fan of IBM's chips, especially in Apple computers (I am a proud owner of a 12" Powerbook).
I don't mean to burst your bubble, but your 12" PowerBook uses a Motorola processor, not an IBM one. I own a 15" PowerBook though and I love it.
That having been said, the IBM PPC 970 or G5 is breathing new life into the PowerMac line and Apple is doing really well because of it. I can't wait until they get it stuffed into a PowerBook.
"When the president does it, that means it's not illegal." - Richard M. Nixon
I don't understand why Transmeta still comes up in conversation. Besides the fact that they hired Linus, what exactly have they done to merit this inclusion alongside IBM, Sun, and Intel? There are plenty of other CPU manufacturers that sell x86 clones now... I think Cyrix was bought by some Taiwanese fab plant company, weren't they?
Until Transmeta becomes a real contender, let's just keep out of the Linux biases and concentrate on the real contenders.
My prediction is that if they don't produce a real hit soon, they will be out of business in 2 years.
Since the author of this article is lurking here, I thought I'd ask:
You make a rather big deal about Transmeta needing to run all x86 code through a "code morpher" (dynamic recompiler, actually), and come up with a decently large set of conclusions based on it.
What's the big deal? No processor executes raw x86 anymore. Everything translates into an internal microcode that bears little resemblance to the original asm. Of course, normal chips have hardware accelerated microcode translaters, whereas Transmeta must recode in software -- but Transmeta's entire architecture was designed from day one to do that, and concievably they have more context available to do recoding by involving main memory in the process.
And what is it with you neglecting the equivalence of main memory? Yes, transistors are necessary to store the translated program. They're also necessary to store the original one -- the Mozilla client I'm presently tapping away inside sure as hell doesn't fit in L1 on my C3! Outside of a small static penalty on load, and a smaller dynamic penalty from ongoing profiling, you can't blame performance on the fact that software needs to be in RAM. Software always needs to be in RAM.
Don't get me wrong -- Transmeta's a performance dog, and everyone's known that since day one. But I think it's reasonable to say the cause is mostly one of attention -- every man hour they threw into allowing the system to emulate x86 took away from adding pipelines, increasing clock rates, tweaking caches, etc. In other words, yes it's a feat that they got the code to work, but you don't need to blame the feat for the quality of work -- they simply did alot of work nobody else had to waste time on, and fell behind because of it.
Much easier explanation. Might even be true.
Yours Truly,
Dan Kaminsky
DoxPara Research
http://www.doxpara.com
"Anyway, if you don't like the article, that's fine. But being a hater about it just makes you look lame."
And once more, Hannibal demonstrates, that his writings @ Ars are more about being "kick@$$ kewl dude" than having the facts in check.
It didn't occur to you, Hannibal, that you just went ad hominem on pz, when he delivered sound and substatial ciritcism to your article ?
"I don't like or use it so one else does"
Real smart.
Any idea the amount of Sun systems are out there? People who use Sun hardware and software, and *gasp*, like it?! Should we only evaluate chips that currentlydo ok in the slashdot market?
Based on upvotes, Ageism is the only "-ism" Slashdotters care about and think isn't SJW
unlike the other x86 knockoff manufacturers they have actually attempted something somewhat new and different in their designs. They may not have met with a roaring success marketwise but they certainly did try to attack things from a different angle. The point of the article seems to be comparing the somewhat different aproaches the various cpu makers took in their designs, not how many millions of chips they have sold or billions of dollars they have in the bank.
Oh my god! Please stop comparing apples and banjos and try to make sense of it!
DDR SDRAM does not "run" at around 400MHz - the frequency of the databus is 400MHz. As you state yourself the power usage is very dependant on the usage pattern and only very few memory cells actualle change state during each write (up to 8 for an 8 bit RAM). I would guess that leakage and discharge of the capacitor cells is a significant factor, which you totally ignore.
In a processor on the other hand, a lot of transistors change state every clock cycle - even during execution of NOPs. Some signals will even change state several times during a clock cycle due to asynchronous races in the logic paths.
First of all I thank you for a great article. You have som interesting views on the Transmeta approach. But like the parent poster I feel you may jump to some conclusions based on assumptions.
It is true, that the CMS has a cost in terms of RAM usage but this does not necessarily translate into extra load latency. As I have understood the clue should be to utilize the fact that in common code you only execute a very little portion of the code most of the time (like 90%/10% or whatever). It should be expected that much can be gained by heavily optimizing these "inner loops", which should translate into reduced load latency as fewer instruction will be executed in total. The execution of the four optimisation runs or JIT compilation should drown in the millions of times these inner loops are executed.
You could say that it is a complete waste of transistors and power usage to have many transistors performing the same optimisation over and over again in the conventional processors. These hardware based optimisation will also never be as efficient as their scope is limited.
There are some interesting perspectives with the Transmeta approach as well. You state that POWER5, UltraSparcIV and Prescott tacle the problem with load latency by using SMT to fill pipeline bubbles from data stalls and thereby increase utilisation of the execution units. This should be possible for Transmeta as well, by upgrading their CMS to emulate two logic processors instead of one.
But you are right! A complete theoretical comparison is impossible - only real world experience will show...
Using street slang makes you sound juvenile. Black doesn't mean uneducated but blacks that say "hater" sound just as stupid as other races.
This is a false economy. Just because you have 32 threads to run doesn't mean you would benefit from 32-way SMT. Remember that you don't just need 32 contexts in your CPU, you need enough cache to be able to feed 32 unrelated threads. The reason SMT sometimes slows down a CPU is that the 2 or more threads running concurrently compete for cache space. If you just run a single thread at a time, it has a whole quantum to fill up the cache and use it.
The way this worked on the afforementioned MTA machine is that the processor had 128 contexts and NO cache. Memory latency was 32 cycles, so as long as you had at least 32 compute-bound threads, there were no cycles lost to latency. However, this means that each thread did take longer to run.
aQazaQa
Actually the multithreaded design is an answer to the lack of parallelism, as most deisgns are able to deal at the thread or process level, hence the parallelism is implicit and does not ned to be "found" That is the whole point. You are citing the limitations of superscalar, not SMT designs.