Intel Demos New P4 'Extreme Edition'
typobox43 writes "Louis Burns of Intel displayed a "high-definition video stream running on a 'mystery' desktop processor." This processor turned out to be the new Intel Pentium 4 Extreme Edition 3.20 GHz, with an extra 2 Megabytes of cache."
"Only in their dreams can men truly be free 'twas always thus, and always thus will be."
--Tom Schulman
Intel Developer Forum Cache for questions
By Nebojsa Novakovic: Tuesday 16 September 2003, 18:14
WHEN, AT today's IDF opening, Louis Burns demonstrated a high-definition video stream running on a "mystery" desktop processor, everyone must hve thought it was the upcoming Prescott part. Wrong! It was the (also upcoming), previously unheard of, even at The Inq, Intel(R) Pentium(R) 4 processor Extreme Edition 3.20 GHz , with an extra 2 Megabytes of pron. In Intel's own words, "this new processor will be targeted at high-end gamers and computing power users."
As a matter of fact, 2MB cache will help a lot those users whose apps (including games and such) have a lot of big cache-friendly *wink* pieces of code and data, but probably not the data-streaming intensive stuff. I do expect to see speedups anywhere from 2% to 20% depending on the application, maybe some more if using multithreading/multitasking (large cache can keep in code / date pieces from more threads).
However, this doesn't seem to be a new CPU in reality - after all, Intel is doing very well with its XeonMP 2.8 GHz 2 MB cache CPU, and how much effort does it really take to repackage it for the 3.2 GHz / 800 FSB desktop with less stringent thermal and reliability requirements than the big iron, anyway?
Intel would gain a lot with this move. If, touch wood, there are problems with Prescott, a large-cache Pentium4 part will provide some buffer against large-cache Athlon64 (i.e. rebadged Opteron) parts. At the same time, enormous extra benefits from the economies of scale would further reduce the identical die XeonMP manufacturing cost, helping Intel compete better on the quad-CPU server front as well. Interesting move? I think so. Let's see how the beast performs in real!
Different pinout, so no.
$740 in 1,000 unit quantities. I think I'll pass.
The difference is that all the existing apps would need to be recompiled to fully use the 64bit. Even lowly DOS can use performance improvements with a larger cache. And with Hyperthreading the number of clocks per instruction is very small, this lends itself to using a larger cache more often.
See also:
Ars Technia on Caching
Some interesting quotes:
"The performance boost is awesome," Burns said Tuesday during a speech at the Intel Developer Forum here.
"It is a Xeon with a different pin-out, or least that's what it looks like to me," said Nathan Brookwood, an analyst at Insight 64.
Intel did not disclose the price of the Pentium 4 Extreme Edition. It likely will be as expensive as its counterpart, the 2.8GHz Xeon with 2MB cache. That chip sells for $3,692 in quantities of 1,000.
"It absolutely will be kind of pricey," Brookwood said.
I'm looking for a HEPA media filter for my TV. I'm alergic to reality shows.
Nope, the only 'desktop' 64 bit processors come from IBM and AMD;
AMD Opteron
AMD Athlon64
IBM PPC970
Intel's 64 bit solutions is the Itanium! Anything with the Pentium moniker is 32 bit. The Itanium is the one which suffers 32 bit emulation lag.
So if you want 64 bit, you're stuck with, realistically, a Mac or some brand of Athlon CPU.
GPL Deconstructed
You're thinking of the Itanium. It used to run 32bit X86 under a hardware emulator, but that was about as fast as the Pentium MMX. Intel has since switched to using a software emulator, something like Transmeta does with the Cruesoe, and it's actually faster than the hardware emulator, about the same speed as a Pentium III now.
The Xeon is a Pentium4 in different packaging and with SMP enabled. Actually, SMP is probably enabled with the Pentium4 too, but since there are no such motherboards and you can't plug them into Xeon DP mobos, nobody can test that. Xeons already come in versions that have up to 8MB of L3 cache, the new Pentium 4 is probably just a rebadged Xeon certified to run on an 800Mhz bus.
The processor-intensive software is already here. It is called HSpice, Verilog, fluid-dynamics simulation, etc. The Pentium 4 has done nicely in the engineering workstation market, and the "Extreme Edition" should do even better.
Please check the SPEC web site for a performance evaluation of the Pentium 4's floating-point (FP) performance. In particular, it outperforms the UltraSPARC III even though the latter has a 2-to-1 advantage in the width of its databus -- 64 bits versus 32 bits.
What changed the x86 chips from also-ran losers in FP performance to the kings of the hill? SSE.
The SSE extension to the x86 instruction set architecture (ISA) opened up a whole new world of applications for the Pentium III and successors. Older Pentiums were saddled with a FP stack that hurt their performance. The SSE extension established a directly addressable bank of 8 128-bit registers or 32 32-bit registers for FP operations. As a result, the Pentium 4 outperforms the UltraSPARC III on video applications.
At 3.2 GHz, the "Extreme Edition" of the Pentium 4 should help the Pentium 4 to capture even more of the engineering workstation market. Nowadays, the first-choice workstation among engineers in Silicon Valley and Boston's Route 128 is Linux running on a fast Pentium/Athlon, not Solaris lumbering on a slow UltraSPARC III.
The second part of their article is here.
Thank you. Drive through.
Computer architecture 101.
Average memory access latency per memory access =
(L1_hit_rate * L1_hit_cycle_time) +
(L1_miss_L2_hit_rate * L2_hit_cycle_time) +
(L2_miss_L3_hit_rate * L3_hit_cycle_time) +
(L3_miss_rate * DRAM_latency)
80-95% of your accesses will hit the 8k L1 in typical applications. This is the vast majority of the accesses. The latency of this cache is TINY on a P4. Do the math for a 3.2GHz 3 cycle cache.
Given a curve of cache-size vs. latency and hit rates for all the cache sizes, the optimal hierarchy is a simple optimization problem. I can assure you that this equation has been solved and the optimal heirarchy has been chosen (given the other constraints of obviously die-size and power).
Quadrupling the L1 will double the latency and kill your average access time, making your chip almost certainly slower.
Bigger caches mean longer latencies. It's limited by the basic laws of physics. There's only so much distance you can traverse in a ceratin amount of time and larger caches have longer distances (meaning higher RC delays).
The reason we want larger outer level caches is because the DRAM_latency is enourmous and has an impact on average access time. Hardware prefetching can also help to alleviate this problem - This solution is available on both Athlon and P4 chips and will only get better in the future because it is absolutely critical to hide this DRAM latency.
Ok, now to address the notion that more registers will improve performance...
You won't get as much performance out of more registers as you might think. First of all, when the compiler runs out of registers it spills the excess to the stack -- pushing it out with a store (spill) and reading it back in with a load (fill).
In modern processors (just about every chip out on the market), there is the concept of store buffers. Each store writes it's data to a store buffer. Subsequent loads that require data from stores, get their data by forwarding out of the store buffer. So -- the spilled store writes the buffer and the fill load reads the buffer -- all of this happening much faster than a memory access because it's just reading out a local on-chip buffer, so the load looks more like a fast register read. This architectural trick emulates the effect of having more registers, subject to the size of your store buffer. There are even more advanced architectural tricks you can play to completely eliminate the spill-fill pair from the critical path (look up memory-renaming in the literature).
If you're worried about chip-real estate, you should be very concerned that a 64-bit application's pointers will take up twice as much space effectively making your caches and memory bandwidth appear smaller.
SSE is single-precision (32bit) floats only, so pretty useless for scientific calculations (usually require doubles).
However, I believe the intel compiler uses SSE2 (which can handle 64bit floats) exclusively for float code, since the P4 legacy fpu is just slow. Of course there are compiler switches for the compiler so the code also runs on good old Athlon, Athlon XP, PIII (which lack SSE2, the Athlon also lacks SSE) - and those aren't exactly slow doing float calculations neither.
Err, no. The internal data bus of the P4 is 256-bits wide, at least if you're talking about it's L2 cache bus. L1 cache doesn't really have a "bus", especially not the P4's trace cache (it's replacement for an L1 i-cache), but if my memory serves me correctly, the L1 d-cache of the P4 can read or write a pair of 64-byte values in 2 clock cycles. I guess that makes it's "bus" 128 bytes (not bits) wide. I don't know the bus width of this new L3 cache on this P4 "Extreme", aka a XeonMP, but I would guess it's 64-bits wide.
I haven't got a clue as to the internal data bus of the USIII, but I would guess that it's either 128-bit or 256-bit wide. Side note: the Power4 uses a MASSIVE 1024-bit wide internal bus, one of the reasons for it's impressive performance.
The only situation where the USIII has 64-bits and the P4 has 32-bits is if you are talking about integer registers or memory pointer width, neither of which are going to play a role in Spec CFP scores.
1) Get rid of 20-stage pipeline, it's too long for anything serious.
No its not. In fact, according to this research, the P4 pipeline is not deep enough. That paper concludes that P4 performance could be improved by up to 90% by increasing the pipeline depth to around 50 stages and increasing the cache size.
Do you actually think that Intel didn't know the consequences of increasing the pipeline depth? The Intel engineers didn't just guess on the P4 architecture- it was a very deliberate design decision. Judging by the P4's performance gains, it was a pretty good decision, too.
"The defense of freedom requires the advance of freedom" - George W Bush
No.... we won't. What you are describing is insane. Come on: 3.2GHz x 32 bits? Access/transfer times over a full scale bus with a latency in picoseconds? Um... no.
There is a reason no one has done that yet - made system RAM the same speed as the CPU - and it ain't economics: it is physics. Nature does not take bribes.
Look, it isn't that it is too expensive to make fast RAM. And it isn't the distance - it is the capacitance. The problem with fast RAM is getting that signal off chip to the CPU. And the wires that connect the RAM and CPU are orders of magnitude higher capacitance than the wires on chip. That is a fundamental problem which you won't overcome without a fundamental change in how you move the data around.
Um.. no. Never will that be the case except in situations where using an archaicly small amount of processing power is adequate. Storage technology, as it is formulated now, cannot approach the speed of access, communication, and storage that even a low grade CPU would use for cache.
Maybe - MAYBE when we are using diamond wafers, high-temperature-superconductor-nanotube-quantum-