Intel to Increase Stages in Prescott
Alizarin Erythrosin writes "Further contributing to the MHz Myth, The Register and ZDNet are reporting that the new P4 core, codenamed Prescott, will have a longer pipeline then Northwood. No official numbers have been released, but The Reg is saying an Intel spokesman said that 30 stages seems to be a reasonable estimate. As most of us know, a longer pipeline can lead to slowdowns in the form of branch mispredictions and pipeline stalls. 'And just as the PIII proved faster than the early P4s in some applications, it's likely that Northwood will similarly prove faster than Prescott, which has clearly been designed for speeds of the order of 4GHz.'"
It'll most likely be slower per clock cycle.
What this means, is that it will take a faster clock cycle (4GHZ, for instance) to do the same amount of processing as the Northwood core. However, increasing the pipeline should allow Intel engineers to achieve higher clock speeds, as the longest transistor path will likely be shorter (faster switching times).
In essence, Intel is attempting to increase the speed of their CPU's by focusing on increasing the clock speed (P4), while AMD is focusing on increasing the amount of calculations per clock cycle (Hammer).
Of course, there are a lot of more complex tradeoffs that factor in (ie. branch prediction). I highly recommend reading a computer architecture book if you're at all interested. It's really facinating stuff.
-=Lothsahn=-
Re-read the register article. Its not the Intel guy who said 30 stages, its the Register who is guessing. They're assuming that since it went from 10 to 20 before it'll go from 20 to 30 now. Its not likely to end up being more than a few extra stages.
I suppose that this makes having a good compiler a little more important. Compiling the same program for a G4 on a compiler other than GCC gave me a 100% speed boost. I don't know if branch mis-prediction came into play, but it had a conditional in its inner loop (it displayed the mandelbrot set).
I had found an interesting article exposing the innards of the 775 pin Prescott -- see it here
(Credit: Got it off The Register from this article)
Intel P4 and Xeon beat 4 of the 5 you name on SPEC.
There may be no oil in Afghanistan, thats why I didn't say oil wells. Also, you must have not heard about the huge plans for a giant ass pipeline that will pass right through Afghanistan.
--
WHO ATE MY BREAKFAST PANTS?
In case anyone wants some hard facts:
A. Hartstein and Thomas R. Puzak (IBM): The Optimum Pipeline Depth for a Microprocessor, ISCA 2002.
M.S. Hrishikesh, Norman P. Jouppi, Keith I. Farkas, Doug Burger, Stephen W. Keckler, Premkishore Shivakumar (UT Austin, Compaq): The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays, ISCA 2002.
Eric Sprangle , Doug Carmean (Intel): Increasing Processor Performance by Implementing Deeper Pipelines, ISCA 2002.
A. Hartstein and Thomas R. Puzak (IBM): Optimum Power/Performance Pipeline Depth, MICRO 2003.
What all these papers have in common is that they find that increasing the pipeline depth past 20 stages increases performance.
Read "Understanding Pipelining and Superscalar Execution" http://slashdot.org/articles/02/12/19/1810214.shtm l?tid=137 . Extended pipelines _can_ improve performance. However, the compiler _needs_ to understand how to take advantage of it. Otherwise, you could end up with slower code.
I would, but I could just get the PCWorld Athlon FX-51@2.2GHz (almost identical to an Opteron 148) vs. 2xOpteron 246 (2.0GHz) vs. Athlon 64 3200+ vs. P4 3.2 vs. 1.8 G5 vs 2x2.0 G5 benchmarks, and see that in all benchmarks except Photoshop (on the dual G5), Quake III on the A64 and O246 (probably the SMP), and Word on the O246, the x86 CPUs *MURDERED* the Macs. Yes, even the P4. BTW, the AMD CPUs did well against the P4, except in the Quake III and Word benchmarks (Intel optimized code, maybe - Q3 is definitely Intel-optimized, but WORD?)
Generally one of the best processor architecture books out there is Computer Organization and Design. It does assume an amount of digital logic design (flipflops, clock, multiplexors and other basics) though it does have an appendix which briefly glosses over those. Honestly, to really "get" it you need an education in it.
-
No processor, barring a complete architecture change (in which case its a different processor entirely) will double its performance simply by doubling the clock speed.
It really depends on how you define performance too and what your software is doing. Doing heavy I/O? Processor has little to nothing to do with I/O - it just hands it off to the bus and I/O controllers to take care of and then does something else while waiting for the interrupt.
-
This was already implemented on the PowerPC 601 and 603 (and possibly others, my book is getting rather old). Additionally, the Alpha 21064 and 21064a processors could optionally guess a branch as taken if it went back(loops), and not taken if it went forward(ifs).
Most processors nowadays use dynamic prediction, basing current predictions upon whether earlier branches were taken or not taken. The branch unit on the P4 predicts with an accuracy of about 95%.
One more interesting way of doing it is to try executing both paths at the same time, and throwing out the one that is incorrect. This requires a lot more logic (although pentium 4's already include "hyperthreading", and this is somewhat similar), and with such high accuracies probably would actually be much worse than the current way of executing.
Ewige Blumenkraft.
You can also go back and "fix" instructions to an extent (and not in all cases) while in the pipeline in case of incorrect branching. x86 sort of sucks for this though because of the variable length instructions.
Alot of computer science is based on those kind of statistics. You see it in memory management as well. Most data structures are created and quickly destroyed. But those that aren't tend to stay around for a very long time and not point to quickly created and destroyed ones.
-
They rate their processors based on how many times faster than a Duron 1 GHz runs. Thus, an AthlonXP3000+ runs three times as fast.
Where the hell do you get that from? The "quanti" speed rating is supposed to compare directly to the P4.
No, surely AMD will simply change their metric to match whatever Intel is putting out. IMHO there's no way AMD will label something 4000 when it's faster than a PV 4400. That defeats the *whole point* of not using the real clock speed in the first place.
Mobile Pentium 4: Cooled-down P4
Pentium 4-M: Redesigned cooler yet P4
Pentium M (Centrino): Redesigned Pentium III to take advantage of modern technology (400MHz bus, SSE2, etc.) and be cooler yet.
Celeron M: Pentium M failure/economic bin. Half the cache.
There wasn't much difference on IPC, but AMD did make a 386DX/40, whereas Intel only made a 386DX/33. 8088 was identical IPC and clock (4.77MHz, Intel design, Intel and AMD build), but 80286 wasn't on clock (was on IPC) (6 to 25MHz, Intel design, Intel (6-12MHz), AMD (6-20MHz), Harris (6-25MHz) build).
For those into the technical side of this type of stuff and heck of a lot higher S/N ration, check out the Ace's Hardware forum. There's a large thread going on overthere taking about the rumors and what it would actually mean.
SSE does standard IEEE754 signel or double precision math. The Pentium 4's SSE2 unit (actually its FPU, but thats a detail) can handle 4 single-precision or 2 double-precision operations per cycle.
A deep unwavering belief is a sure sign you're missing something...
MMX doesn't do FP [it's int only].
Both SSE and 3DNOW use formats the normal FPU can read so I'd say it's standard [hint: you can assign an array of two well aligned floats to a 3dnow 64-bit word and use it].
SSE supports both double/float precision [as another poster pointed out]. Heck even the Athlon supports SSE [though I wouldn't use it. Hint: SSE reg == 128-bits and the Athlon CPU can only perform upto 64-bits of read per cycle...]
Tom
Someday, I'll have a real sig.
I thought that SSE and MMX both had significantly lower precision than standard IEEE floating point ops. If I'm wrong, please correct me, but if it is lower precision, it makes it useless for Real Work(tm).
It performs precise math by default. You can only use 32 or 64 bit floats, the "long double" 80 bit floats are not supported. But this often isn't a problem. You can also turn off denormals, and with interupts on bad math (divide-by-zero type stuff). Turning those off hasn't given me any performance boost, but I still consider these things features not bugs. There are some low precision operations available, but no compiler I know of uses them unless you ask for em. I do in some cases but then I know what I'm getting.
A math person may give you a better answer than me. I'm a graphics person, a field where SSE2 is a godsend compared to the stack based floating point units that came before.
AMD's higher end is a bit pricy, but that's to be expected. Intel can't compete in the mid range, and is getting totally killed on the low end. An Athlon XP 2200 is around $60. That's more expensive than the slowest P4 - the 1.4Ghz. It's even cheaper than the lowly Celeron 2.2Ghz. In that sense, Intel is way overpriced. By the way, Intel's latests chips run just as hot as their AMD counterparts. The days of the cool running PIII are over.
Your EE or ME or ChemE full professor as a grad student could have written a FORTRAN program to compute some stuff and write output to a numeric text file or perhaps draw some plots using a subroutine library. You are probably thinking that anyone who can't sling together C programs using VI to draw graphics straight to X is a luser, but I am talking about pretty technically savy people who don't have time to spend on this stuff and who employ armies of Engineering majors from foreign lands who are not up on this stuff either.
My own take is that if a particular numerical calculation can be easily programmed by some package, it must not be on the cutting edge of research because someone has already done it. Besides, if your software package is really deep, most of the effort goes into the architecture and the data flows and into graphics, and the RAD bit is only simplifying a tiny part of what you are spending your time. A high-power scientific data visualization is really a video game, and how many video games are implemented in Matlab?
But what Perl is to text processing, Python is to collections, and VB is to slinging together a GUI, Matlab is to numerics (what used to be FORTRAN libraries) -- it may not have the best algorithms, but it has a lot of algorithms -- it has a semi-decent scripting language, and it has some facility with producing plots from your computations and other data.
Now that's the thing -- if you are doing matrix operations or using some canned function (most likely C under the hood), Matlab is as fast as fast can be. The minute you start looping in Matlab, it is interpreted and the speeds are in the Python range.
Before you knock it completely, it has very good integration with Java modules -- more seamless than with C modules. While Java may be pokey for its GUI, for tight numeric loops the JIT is almost as fast as C -- no joke, a person should consider writing numeric extensions to Matlab in Java of all things, especially on Windows where they tweaked up Java 1.4.2_03. And how many scripting languages (OK, Jython) have this level of Java integration?
But as a scripting language, Matlab has its shortcomings. It started out as a matrix calculator and has had features grafted on in a hodge-podge Visual Basic 6.0 kind of way. In terms of its data type restrictions and fubar scoping rules and brain-dead object extensions, I don't think, as they say, it scales very well.
My other peeve is that it is proprietary, and while Math Works is not Microsoft, I worry if engineering schools, emphasizing use of "commercial packages students will use in the real world when they graduate" (as opposed to professors dinking around with their homebrew software for use in instruction), are becoming trade schools shilling for the big software houses. I don't have a lot of experience with it, but in place of Matlab we should be using stuff like Python and the Python NumPy extension -- Open Source alternative, comparable performance, C extensions for speed, but much more Turing complete, consistent, and scalable.
And where is Matlab 6.5 using Java internally? Try doing a Files Open to start editing a Matlab script (M-file) with the Matlab editor window. One potato, two potato, three potato, and the window comes up. Now what language has that kind of GUI lag, I wonder what it could be?
Before you run off blaming the evil Marketing demons, let me ask you this.....what readily quantifiable measure would you use instead to compare systems for the broad range of users and applications - all other things being the same? (memory, disk, etc.)
... other things might be good. As with any benchmark, there are always caveats and special conditions involved. If one simply averages the scores of many benchmarks things happen such as one candidate doing rediculously well one one (possibly unimportant) part of the benchmark, thus throwing the average way out of kilter.
Imperfect a measure that it may be, it's a hell of a lot easier to relate to and compare than "how many FPS of Quake3 can I get?" or "how quickly can it compile the 2.6 kernel?"
That very question has long been a topic of heated debate. Years ago, AMD launched an initiative to create a nonbiased (so they say), general purpose universal benchmark. It never went anywhere as far as I know.
Overall, Winbench 'XX is a good benchmark because it shows actual performance in real-world applications (albeit somewhat old ones). For games, the only reliable means of benchmarking is to test those individual games, or at least assume similar performance across many games that use the same game engine. The game industry is converging because of the extreme difficulty of developing truly sophisticated 3D graphics engines. I predict that within 5 years, there will be at most 3-5 major game engines used by 90% of high-budget games. A general benchmark of these 3-5 engines (or however many there turn out to be) could be used, either taking their average and giving an overall "gaming score", or predicting the performance of the many games based on each engine based on extensive benchmarking of a few titles using each.
Server benchmarking is not an issue, because those involved in the tests often know what they are doing.
As far as unix benchmarking, well, that is a major pain in the ass. That certainly does not mean that we should rely on clockspeed, or god forbid on BogoMIPS. A standard benchmark based on the compilation time of a certain version of BASh was proposed not too long ago. Because many Unix geeks are developers, this would not be a bad start. As for pure CPU tests, perhaps a mix of BZip2, large-scale encryption, and
Benchmarking is a science, an art, and a rather large pain in the ass.
Your point is well taken though.
Computer Science is no more about computers than astronomy is about telescopes. --E. W. Dijkstra
The decode bandwidth is a single x86 instruction per clock, but that's not a huge problem because of the trace cache. The issue bandwidth is three u-Ops per cycle, but this isn't a huge limitation because the P4 is a relatively narrow architecture. Its only got two ALUs and two FPUs compared to an Athlons three ALUs and three FPUs.
A deep unwavering belief is a sure sign you're missing something...
You have to remember that a garden variety PC is a very unpredictable environment. You have network packets coming in, mouse events, keyboard presses, USB chatter, DMA access, every event generates and interrupt that requires the processor to stop what it's doing, and start the pipeline over again.
It's nothing personal, but articles like this one, as well as posts like this, drive me absolutely batty with the amount of incorrect ideas propagated. It's not that one particular person is misinformed -- it's just that the amount of generally bogus information is silly.
First off, at some point, as far as I can tell, a bunch of people read Maximum PC or somesuch consumer "PC enthusiast" magazines, and read about "The Megahertz Myth". Maybe Ars Technica ran the story that started all this. Heck if I know. All that the original author was trying to do was point out that people shouldn't judge processors strictly by clock speed.
Boy, did they ever create a monster. Somehow, a bunch of folks managed to get the idea that Intel was pulling this as some sort of PR job to deliberately trick people into buying their processors. For Chrissake, this is such an incredibly stupid idea. The OEMs have purchasers that know what they're buying. Not only are they not going to just sit down and look at benchmarks, they're going to have a bunch of test machines built when deciding what to go with. That and business considerations outweight any "MHz rating". The OEM market just plain doesn't care. The only people getting excited about the "MHz Myth" are the "PC enthusiasts", a tiny, tiny sliver of a group when it comes to dollar value. If the sort of "PC enthusiast
riffraff really think that they constitute any kind of a significant market to Intel -- enough for Intel to *redesign their entire processor*, using a longer pipeline and higher clock rate, around getting them to purchase a computer, they are vastly overestimating their own importance in the universe.
When Intel makes the decision about a new processor, it's a pretty safe bet that they don't run out and say "Gee, how would Joe Assmunch in Marketing like us to structure this thing?" They have many, many PhDs in chip and circuit design who have many competing ideas about what the best designs would be. They run many, many simulations before even thinking about deciding on major design decisions.
The "PC enthusiast" folks who think that Intel has taken this path to trick those people that buy from Dell, and that, ho ho ho, *they* are smart enough to see through the trick are ridiculous. If Intel wanted a high clock rate to put on stickers, they could jack the thing through the sky, run at 10GHz, then demux data and only accept data at a lower rate into the various units. Some of the units would move to even more instructions per cycle.
The *current* poster is talking about *keyboard* and *mouse* events? "USB chatter"? Those don't even show up on the *radar*. You roll that mouse, send your 200 Hz interrupts, and you worry about 200 measly mispredictions per second? Just blowing away the page table cache during process switches (which runs at 100 Hz on Linux 2.4 x86 by default) already dwarfs any misprediction performance hit from the said devices, and folks frequently bump it up by an order of magnitude or so and don't see any measurable performance hit -- on Pentium IIs.
As for DMA, the entire point of DMA is so that the processor *isn't* running code from the host. It can continue on in its own happy little world while a co-processor pokes at the memory bus.
You might see significant branch misprediction issues with an inner loop with a branch statement that flicks back and forth just about every loop or so to screw over the branch caching. And "significant" is still pretty minor. The compilers hint to the CPU whether a branch is likely to be taken...it's not as if there's this massive, awful mistake that all the chip designers in the world are making that Joe I-Built-My-Own-Computer-
May we never see th
Note to mods: parent is clearly wrong. How did this get +5? As others have stated, the AMD rating is an estimation of how fast their processor is compared to an Intel Pentium 4 running at the PR speed in megahertz.
No. You are clearly wrong. The PR rating is relative to an AMD Thunderbird Core. If you don't know what you are talking about, you should just shut up. Here is a link and here is another.
Intel are shouting about megahertz because its all they have. For most real world applications (ie. Not encoding video) the Pentium 4 cores are abysmally inefficient. Anything that is branch heavy (such as a compiler, for example) is a complete nightmare for a P4.
For that matter, I'm writing a video encoder in my spare time, and the AMD chips are still a better match for the sort of stuff I am doing.
---
Present-day x86 chips aren't limited by their FP processing speed
The problem with x87 is not speed. It uses an antiquate programming model, using a stack. So you have to shuffle things in the stack to make it work, and this takes a lot of time.
SSE2, OTOH, is very easy and fast. 2 calculations at the same time, and in the format A+B=C
how long until