AMD vs Intel: CPU Design Philosophy
Johan writes "We have published an in depth comparison between the CPU design decisions that AMD's engineers (Athlon and upcoming Mustang) made and those of Intel's engineers (Pentium 4).
Some of the questions answered:
Are double pumped, hyperpipelined, low latency designs the only future for x86? Will future designs from AMD and other competitors be similar to Intel's innovative seventh generation core? "
Intel, however, is truly making an innovative processor design with the P4. The speed of electricity is possibly becoming a bottleneck (As the 2 "drive" stages in the pipe Ace's pointed out as possibly being for the signal to reach the other side of the chip) The only problem with this is that AMD has caught up extremely quickly in the past year, and while imperfect, the Athlon design scales in clock speed extremely well. With the 1.2Ghz Thunderbird here, on a .18 micron process, no less, as long as AMD can keep up with the process technology they will stay in the high end market.
The P4 uses about twice as many transistors as a Thunderbird or Coppermine, in order to achieve the massive hyper-pipelined design that they have. AMD on the other hand, with Sledgehammer, will integrate 2 CPU cores onto one die with a shared L2 cache. I would imagine that the design for Sledgehammer is similar to the Athlon, but with 64 bit extensions. Why not use what technology they have and refine it instead of reiinventing the weel?
IBM, with Blue Gene, is taking this parallelism to an extreme (Quote at bottom of my post is directly off of IBM's site), and AMD is taking a similar route on a much smaller scale. Now think of the potential performance difference between a P4 1.5Ghz and a 1.2Ghz Thunderbird given that the P4 is slower at the same clock speed. Almost no perceptible performance difference, in all likelyhood. Now imagine a dual 1.2Ghz thunderbird, and imagine how that would perform in comparison: yes, all of the systems are extremely fast, but the "dual" system would stand out as the fastest. Take into consideration that the die size for Sledgehammer won't be much more then what it is for the P4. So, you will be able to get a dual-cored CPU for around the same price as a single cored CPU that gets lower IPC and runs at a higher clock speed.
As you can see, there will be no comprison for Sledgehammer on the desktop as long as there is enough memory bandwidth to satisfy its needs.
Eight of these boards will be placed in 6-foot-high racks (16 teraflops), and the final machine (less than 2000 sq. ft.) will consist of 64 racks linked together to achieve the one petaflop performance..."
Only those who dream can grasp reality.
Rather, it is a piece of self-promotion by Ace's Hardware, who sent this story in themselves.
Many websites send notices of their original content to each other, especially when they know that it is excellent content, like this article. ArsTechnica sends notices both to Ace's and to
The article itself doesn't say anything the knowledgeable don't already know.
This is false. I am a hell of a lot more knowledgeable in matters of MPU architecture than you, and I learned quite a bit. But I suppose you were already an expert on the intricacies of load-store reordering on the P6 vs. the K7, on the precise weaknesses of the K7's branch prediction algorithm (i.e. that it throws an exception and flushes its BTB when presented with more than two branches in a 16-byte aligned code window), on the dependancy scheduling problems of very large instruction reorder buffers and what they imply about the P4's clock-speed ramp. I suppose you'd already seen benchmarks which measured the effects of L2 latency and branch prediction on IPC. (You wouldn't mind posting a link, would you troll?)
In fact, it reads like a high-school report, and not even a very well-written one. E.g., "First we will try to analyze the most important shortcomings, next we will search for possible solutions." Sounds just like the simplistic expositions of a high school term paper.
Way to go, asshole. The author's name is Johan De Gelas. He lives in the Netherlands. ENGLISH IS NOT HIS NATIVE LANGUAGE. I'd like to see you post a single sentence in Danish, much less an incredibly insightful article on competing philosophies in next-generation 1.5 GHz+ MPU design.
Look, I know that there is a lot of mumbo-jumbo laden "technical" architecture discussion going around the web, often quite nonsensical and written by good-old fashioned Americans who just haven't had the benefit of 8th grade grammar (or a solid education in MPU design). The point is, you were horribly wrong to lump this article in with that schlock, and you apparently did so only because it contained terms and explanations which you didn't understand. Furthermore, you made your point, with quite authoritative tone, in a public forum. Of course you have every right to be loud and wrong in
I repeat: the article is not a technical piece at all. Hannibal at ArsTechnica writes technical pieces about CPU design. This article at Ace's Hardware says nothing insightful.
Completely backwards. Now, let me first say that I not only respect Hannibal tremendously, but that his articles (particularly the excellent RISC vs. CISC in the Post-RISC era) were what inspired me, a bit over a year ago, to begin to learn much more about MPU architecture and design. They are written very vividly, with strong prose and excellent, clear analogies. They do a fabulous job of explaining complicated concepts and new trends in MPU design to a lay reader.
ArsTechnica, like
So by all means, people--if you're reading this and want to learn about the fascinating world of MPU design, start with Hannibal. But just know that his articles, while very good, are *not* technical; when you want technical, a great place to start is Ace's.
Now that we're through with that bit of unpleasantness, let's clean up your misstatements, shall we?
In fact, it misses the point. It dares to call the P4 "innovative" and wonder whether future designs in the x86 world will copy it. Well, of course not! How many times must it be said that the P4 barely keeps up with the Athlon and performs less well than a P!!!? Because, that is a fact. Numerous production samples have leaked, with the test results uniformly and without exception pointing to the fact that even if the platform's performance is improved by release time--which it should, since these are samples not a retail product--it won't outperform a P!!! with equal clockspeed. That's why the P4 is being released at 1.4 and 1.5GHz initially, because if they were released at 1.2GHz they'd be outperformed by the 1GHz P!!! and that wouldn't be good.
Oh really. Just like preproduction benchmarks of the K7 proved it to be "closer to that of a Celeron 366 than any Pentium III." Just like preproduction benchmarks of the PII lead to the following insightful comments from Tom's Hardware (a leader in the "P4 is overhyped, clock-speed isn't everything, blah blah blah" ignorance these days...):
Guess what: preproduction benchmarks are always wrong. Again, preproduction benchmarks are always wrong. And in particular, the benchmarks we've seen on those preproduction P4's are--just like the benchmarks included in the articles above (i.e. the K7 scoring only 60% of a clock-normalized PIII on FPUMark; the PII doing worse on 32-bit code than a P5-MMX)--utter nonsense given what we know about the P4's design . Thus the logical conclusion is that, just like the preproduction MPU's "benchmarked" above (and let me remind you that those were at least close enough to final silicon to be clocked at release-ready clock speeds), the P4's we have seen "benchmarked" on the web so far have been sandbagged.
Now, the common reaction to these charges goes something like this: "Sandbagged? Impossible! After all, these P4's are at most one stepping from final silicon, maybe even final silicon! Thus they can't be sandbagged!" Which is utterly false. Obviously the sandbagging isn't done in the chip design--that would be idiotic. Rather, it is done in microcode. Every feature of the chip can be turned on and off, tuned and detuned, in microcode. Thus it is trivial to ship a preproduction MPU off for validation with, for example, part of the L2 cache disabled, or the BTB or instruction reorder buffers set to flush when they don't need to, or the way prediction on the two-cycle L1 cache turned off, or tuned wrong, or with certain x86 instructions mapped to unnecessarily slow circuit paths, or any of dozens and dozens of different things set wrong. Indeed, this is the common state of internal preproduction MPUs, because the only way to test corner cases and pathological cases is by disabling one part of the chip and thus placing unrealistic stress on another. In other words, preproduction chips are sort of like beta software--full of DEBUG code which slows everything down, but isn't worth taking out until you're sure everything works.
"But," you may say, "why would Intel sandbag their preproduction P4's when they know benchmarks will leak out?? Why not build up the hype and all that??" The answer, again, is simple. If you take a look at Intel's history of dealing with prerelease cores, you find that they only hype the projects which are likely to underperform horribly--the i860, the iAPX432, Itanium--and they significantly underplay the ones which are going to kick major booty--eg. the P6 core and now the P4. "But why???" Easy. If Intel has a project which sucks, the best they can hope for is to scare off their potential competitors from the market space until they can get another crack at it. (Remember, there's a 3-or-more year lag-time between the decision to start--or not start--a project and the finished product.) That's exactly what they've done with Itanium, scaring MIPS out of the high-end RISC business, and putting Compaq and HP years behind on their high-end RISC designs, with nothing but a bunch of IA-64 FUD. Meanwhile, if their upcoming core is going to perform incredibly, why waste time hyping and giving your competitors the tip-off?? All that would do is cannibalize the sales of your current MPUs as people wait to get the amazing new chip due out in 6 months. Worse, if Intel hyped the great performance of the upcoming P4, they would need to admit that the average PC user can actually use 1 GHz+ performance...which, of course, would play right into the hands of AMD which is the only player with decent 1GHz+ volume until well into next year. This way, you get to surprise the industry, get great press, and sell off way more of your old, now obsolete chips. Simple, really.
Now, the P4 barely keeps up with the current-generation Athlon Thunderbirds. This is important to note because people always *blamed* AMD for a processor which still, with the advantages of the P!!! SIMD intruction optimizations used in much software, didn't quite keep pace with Intel's offering in the most common benchmarks. Now, the technically knowledgeable know that the Athlon whomps the P!!! in anything that isn't SIMDified, and that its floating point unit is head-and-shoulders above. But people still moaned about the performance gap in certain common SIMDified benchmarks.
Wrong, wrong, wrong. The only cases in which the Athlon clearly bests a Coppermine P3 is in scientific (i.e. double-precision) FPU-heavy simulations, ray tracing, etc. On almost every other benchmark, they are within +/-5% at identical clock speeds, with a few standouts at around +/-8% for each architecture. In particular, 3D games tend to show an affinity for the Coppermine. Blaming this on some "SIMD bogeyman" is ridiculous--every 3D game, and especially a standout game like Quake 3, is optimized for 3DNow just as it is for SSE. Now, you can either deny the facts, or you can try to understand them.
The main culprit, of course, is the difference in L2 latencies. Tbird has a 64-bit bus to L2 at a latency of 11 clock cycles, with 384Kb total cache; Coppermine has a 256-bit bus to L2 at a latency of 7 clock cycles with 256Kb total cache. The Tbird has the bigger cache because the cache design is exclusive; however, it also has much longer latencies for this and other reasons. In the end, there is no comparison as to which is the better design--the Coppermine's cache hierarchy is simply better than the TBird's, no argument about it. And Johan's benchmarks illustrate this rather nicely.
Well, here's what they didn't realize: the Athlon is a truly seventh-generation core--which beat Intel to the punch by, what, almost a year and a half? As such, it has made trade-offs to be able to scale to higher clockspeeds better--one reason why Intel had to recall, and still hasn't re-issued, the 1.13GHz P!!! yet AMD are easily churning out 1.2GHz Athlon Thunderbirds.
"The Athlon is truly a seventh-generation core." What does that mean??? If you think it means the K7 core has one single architectural innovation which does not exist on an MPU available before it, then I challenge you to list it now. (Indeed, I can't think of a single innovation in the K7 which isn't in the P6 core--except for the exclusive cache architecture, which is an overall weakness compared to the Coppermine cache--but there may be some.) If you think it means the K7 is a better core than the P6, well, you're right. The K7 is indeed a better core, in that its pipeline stages are more evenly balanced, and thus it can scale to higher clockspeeds on similar process. On the other hand, the K7 is less well balanced from an execution resources standpoint, including such oafish features as a fully 3-wide FPU (as opposed to the P6's 1.5-wide FPU), which offers at best 40% better performance, but generally no better performance than the P6 on FP intensive apps. Yes, the reason for the discrepancy is partly due to code which is compiled with the P6's execution resources in mind--but of course, that will continue to be most things so long as Intel has the majority of market share (AMD currently sells out all the MPUs it can make and thus has no theoretical way of getting majority market share for at least the next 4 years or so), and most apps are precompiled binary. But it's partly due to the fact that there's just not enough need for 3 full FPUs to justify the die space they take. This is just one example, but the end result is that the K7 is a well-balanced core pipeline-wise which is larger and consumes more power than it can justify based on its ability to get instructions from cache and memory. It is still the fastest thing out there, but it uses brute force to make it there. Time-to-market issues are behind some of these design issues, and some of those will be solved with the upcoming Mustang/Palomino/Morgan core tweak. But that still won't make the K7 anything more than a rebalanced tweaked-out brute-force of a P6. And hey--that ain't bad. But it ain't innovation.
The P4, on the other hand, includes many features never before seen on a commercial MPU. They include: double-pumped ALU, integer decoder and scheduler, and integer retiring (running at up to 4 GHz on a
It is, all-in-all, a very impressive looking chip, more than worthy of the title "seventh generation", whether it turns out to perform well or poorly. However, meaningless sandbagged benchmarks aside, all indications are that it will perform magnificantly. Taken as a whole, the P4 contains not only the sorts of design changes necessary to *double* clock speed on a given process over the P6 (note:WOW), but also *increase* IPC. But we'll see how this beautiful looking design translates to reality when the first actual P4's are released and benchmarked.
Blah blah blah, biased statements towards Ace's.
Ace's is in general a slightly AMD-biased site. "Unfortunately", Johan, Brian, and the rest of the crew there "have to" read the thoughts of actual MPU experts day in and day out in their technical forum, and thus know that the case for the K7--and against the P4--is not what the average hardware site has made it out to be. This is not to take anything away from AMD, which has at the moment by far and away the fastest performing MPUs on the planet, the best binsplits on the planet, and about 1.4x the performance/price of Intel all the way up and down their price lists. However, all appearances are that, once the P4 moves into heavy volume production (note: not until Q3 next year at the earliest, after a process shrink to
I am more interested in the differences between the PowerPC and x86 manufacturers than the intra-x86 manufacturer fighting. I think Motorola could learn a lot from the recent design trends of AMD and Intel, about ways of pushing the megahertz envelope, while Intel and AMD could learn a lot about energy efficiency and overheating from the PowerPC camp. Transmeta notwithstanding.
When the P6 was released, it was the fastest processor available in industry standard benchmarks (SPEC, including Alpha). Its design was highly original, and manages to keep the CISC nastiness contained to the first few stages of the pipe. Claiming that the P6 was not a world-class design when released is only a testament to your own ignorance.
Exactly correct. If I had moderator points, they'd be yours.
And indeed, the 1 GHz P3--on that same, 5 year old P6 core--is still tied with the moderately-vaunted brand-new mucho-expensive (not available until Q1) 900 MHz UltraSparcIII in SPECint2000. The 1.2 GHz Athlon would presumably perform even better (once they release SPEC scores from the new Compaq Fortran compilers), making it second only to the fastest (and also none-too-available) Alphas in terms of pure performance. The x86 ISA may be suboptimal, but Intel and now AMD have been able to keep up with the best--and most expensive--of the RISC world due to superior engineering (except when compared to the excellent Alpha team) and superior process technology. Sure they may not have the i/o bandwidth, RAS, or operating systems to compete in the big leagues, but anyone dissing today's x86 chips on account of their designs or engineering qualities is, as the poster said, demonstrating their ignorance.
And if Compaq doesn't hurry the EV68 (die-shrunk Alpha) to market, the P4 and perhaps Mustang as well will blow by even the mighty Alpha, in SPECint and possibly even SPECfp. (The last real knock against the x86 ISA is that it is saddled with the horrendous x87 fp architecture, which is why x86 SPECfp scores trail everyone else by so much. With the P4's upcoming SSE2 instructions, however, that problem may be in the past.) Aesthetics aside, there is no doubt that x86 processors, taken as a whole, are easily the best designed, highest performing MPU's around.
Actually, when the 760MP chipset comes out from AMD, you'll be able to use 2 different speed processors on the same board.
It's point-to-point multiprocessing, instead of symmetrical. You can, for example, buy a 760MP with a 1Ghz CPU now, and put on a 1.2Ghz as the second processor later. And each chip has it's own Northbridge and path to ram, as opposed to the shared GTL bus on an Intel.
They FINALLY demonstrated the prototypes, so the real boards should be out Real Soon Now.
--- "So THAT's what an invisible barrier looks like!" - Time Bandits
Rather, it is a piece of self-promotion by Ace's Hardware, who sent this story in themselves. The article itself doesn't say anything the knowledgeable don't already know. In fact, it reads like a high-school report, and not even a very well-written one. E.g., "First we will try to analyze the most important shortcomings, next we will search for possible solutions." Sounds just like the simplistic expositions of a high school term paper.
I repeat: the article is not a technical piece at all. Hannibal at ArsTechnica writes technical pieces about CPU design. This article at Ace's Hardware says nothing insightful.
In fact, it misses the point. It dares to call the P4 "innovative" and wonder whether future designs in the x86 world will copy it. Well, of course not! How many times must it be said that the P4 barely keeps up with the Athlon and performs less well than a P!!!? Because, that is a fact. Numerous production samples have leaked, with the test results uniformly and without exception pointing to the fact that even if the platform's performance is improved by release time--which it should, since these are samples not a retail product--it won't outperform a P!!! with equal clockspeed. That's why the P4 is being released at 1.4 and 1.5GHz initially, because if they were released at 1.2GHz they'd be outperformed by the 1GHz P!!! and that wouldn't be good.
Now, the P4 barely keeps up with the current-generation Athlon Thunderbirds. This is important to note because people always *blamed* AMD for a processor which still, with the advantages of the P!!! SIMD intruction optimizations used in much software, didn't quite keep pace with Intel's offering in the most common benchmarks. Now, the technically knowledgeable know that the Athlon whomps the P!!! in anything that isn't SIMDified, and that its floating point unit is head-and-shoulders above. But people still moaned about the performance gap in certain common SIMDified benchmarks.
Well, here's what they didn't realize: the Athlon is a truly seventh-generation core--which beat Intel to the punch by, what, almost a year and a half? As such, it has made trade-offs to be able to scale to higher clockspeeds better--one reason why Intel had to recall, and still hasn't re-issued, the 1.13GHz P!!! yet AMD are easily churning out 1.2GHz Athlon Thunderbirds. The P!!! only scales well up to 1GHz--even then, it needed a microcode update to be stable--while the Athlon Mustang has hit 1.2 GHZ with no problems. Heck, Duron 600's usually overclock to at least 900MHz.
In other words, you can't reasonably compare a core optimized to scale to low clockspeeds and take advantage of them, to a core designed to scale up to extreme speeds. You have to compare the Athlon Thunderbird core to Intel's own belated seventh-generation x86, the P4. And, the Athlon Thunderbird compares very favorably. It hasn't been released at 1.4GHz, and probably won't be since AMD will undoubtedly release the newer core before then, but an extrapolated 1.4GHz Athlon Thunderbird, in line with how performance scales for the that core, beats the 1.4GHz P4 samples that have been tested. THE ATHLON BEATS IT. So, how can you call such a low-performing core innovative? It isn't. I'd wager that the next core AMD have up their sleeves will be the real innovator here. Plus, to get the performance it does, Intel's P4 even has to use a 400MHz-effective FSB and double-pumped ALU. This makes the P4 core iteself look rather weak in comparison with the Athlon, which gets by with similar performance with merely a 200MHz (soon, 233) FSB and a non-double-pumped ALU. So, the core of the Athlon is clearly, in itself, much stronger than that of the P4. AMD will doubtless be using similar tricks in its future revisions, but it cannot be doubted that the P4 is not the "innovation" that this BS article claims it is. The article even belittles Athlon's branch prediction--which is weak, because the core was rushed--not noting the fact that even with such a poor branch prediction mechanism the Athlon core outperforms the P4 on a theoreticl clock-for-clock basis.
I note the "theoretical" because I'd like to again point out that the Athlon core is soon to be released in a new revision which will scale to higher clockspeeds, have larger cache, and have improvements to the core itself which AMD has not yet specified. I think that this article at Ace's Hardware is so utterly biased against AMD and for Intel that it makes me sick. He talks of everything negative about the Athlon as being a "compromise" or a decision made in a rush, yet he plays down the negative aspects of the P4 core--for example, he plays down the 19-cycle branch misprediction penalty in the P4 by hyping the P4's escellent branch prediction algorithms, but doesn't give the Athlon slack about its lackluster branch prediction mechanism based on the fact that it has a reduced misprediction penalty. Ace's Hardware has always been biased for Intel and against AMD, and it shows here. The P5 core is hyped as a big "innovation," but not once is that word used in reference to the Athlon, which performs at least as well (probably better clock-for-clock, as I pointed out) and got there to the seventh generation almost A YEAR AND A HALF before Intel. The one place where he FINALLY gives AMD credit is in the conclusion, and even then it's marred by renewed complaints. This is funny, since this article was allegedly a follow-up to Ace's earlier look at the P4 core by looking at the Athlon core in that light. For all the nice things finally said about the Athlon in the last paragraph, he never once used "innovative" regarding it, despite giving the moniker to the P4 at least twice.
And, as a final note, what I've just said doesn't really matter all that much, because the above poster was RIGHT: all that matters is who can deliver the most PERFORMANCE at the least PRICE. And that is, clearly, AMD. That comment *is* insightful, as far as it goes, because that's all that really matters. Why don't we all use Alphas or PowerPCs, which are much more beautiful architecturally? Because they can't give us the price/performance of an Athlon or dual P!!! system. In the final analysis, that's all that's important.
"The more corrupt the state, the more numerous the laws."--Tacitus, *The Annals*
Yep, It's pretty clear to me, too- Marketing. Intel has clearly decided that MHz sells, not real world performance. They clearly believe that the average buyer doesn't know enough to look at overall performance, particularly when there's a single, easy to follow number that supposedly measures speed. The sad part is that they're almost certainly correct. There are a lot of people who believe that MHz is the ultimate measure of a processor's goodness, so the hypothetical 2 GHz PIV will be obviously better than a 1.4 GHz AMD, even if the actual performance of the AMD chip is higher.
There's no point in questioning authority if you aren't going to listen to the answers.
I like how this article addresses the perception that the company not leading the market is only following. With huge high tech firms like AMD and Intel, hundreds of incredibly intelligent people are put to work to solve a complex problem while following a carefully outlined strategy. In reality, corporate warfare is much like a chess game between grandmasters. IMO, each companies strategy is a strong one, and the winner will be decided by a variety of market forces including which strategy works best for tomorrows software (and who can tell now?). Both companies are planning masterful strategies to the problem of x86 design, and I think that as a so called "learned layman" in the processor business, it is quite a bit of fun to sit back and watch.
Transmeta's Code Morphing technology is designed to emulate CISC architectures efficiently. Think about it: could you do an emulator for one Crusoe chip's internal architecture on another model of Crusoe chip? Goedel says it'd be quite tough.
Will I retire or break 10K?
Tom's Hardware disects these terms a good bit, and compares the various processors that use these platforms. Be warned that Tom is a little biased as an anti-Intel kind of guy.
Calling anything from the x86 world a "masterpiece" seems, to me, like putting a gold star on the best-looking fingerpainting in the special-needs Kindergarten class.
-A.P.
--
* CmdrTaco is an idiot.
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"