Intel Demos New P4 'Extreme Edition'
typobox43 writes "Louis Burns of Intel displayed a "high-definition video stream running on a 'mystery' desktop processor." This processor turned out to be the new Intel Pentium 4 Extreme Edition 3.20 GHz, with an extra 2 Megabytes of cache."
"Only in their dreams can men truly be free 'twas always thus, and always thus will be."
--Tom Schulman
Intel Developer Forum Cache for questions
By Nebojsa Novakovic: Tuesday 16 September 2003, 18:14
WHEN, AT today's IDF opening, Louis Burns demonstrated a high-definition video stream running on a "mystery" desktop processor, everyone must hve thought it was the upcoming Prescott part. Wrong! It was the (also upcoming), previously unheard of, even at The Inq, Intel(R) Pentium(R) 4 processor Extreme Edition 3.20 GHz , with an extra 2 Megabytes of pron. In Intel's own words, "this new processor will be targeted at high-end gamers and computing power users."
As a matter of fact, 2MB cache will help a lot those users whose apps (including games and such) have a lot of big cache-friendly *wink* pieces of code and data, but probably not the data-streaming intensive stuff. I do expect to see speedups anywhere from 2% to 20% depending on the application, maybe some more if using multithreading/multitasking (large cache can keep in code / date pieces from more threads).
However, this doesn't seem to be a new CPU in reality - after all, Intel is doing very well with its XeonMP 2.8 GHz 2 MB cache CPU, and how much effort does it really take to repackage it for the 3.2 GHz / 800 FSB desktop with less stringent thermal and reliability requirements than the big iron, anyway?
Intel would gain a lot with this move. If, touch wood, there are problems with Prescott, a large-cache Pentium4 part will provide some buffer against large-cache Athlon64 (i.e. rebadged Opteron) parts. At the same time, enormous extra benefits from the economies of scale would further reduce the identical die XeonMP manufacturing cost, helping Intel compete better on the quad-CPU server front as well. Interesting move? I think so. Let's see how the beast performs in real!
$740 in 1,000 unit quantities. I think I'll pass.
The difference is that all the existing apps would need to be recompiled to fully use the 64bit. Even lowly DOS can use performance improvements with a larger cache. And with Hyperthreading the number of clocks per instruction is very small, this lends itself to using a larger cache more often.
See also:
Ars Technia on Caching
Some interesting quotes:
"The performance boost is awesome," Burns said Tuesday during a speech at the Intel Developer Forum here.
"It is a Xeon with a different pin-out, or least that's what it looks like to me," said Nathan Brookwood, an analyst at Insight 64.
Intel did not disclose the price of the Pentium 4 Extreme Edition. It likely will be as expensive as its counterpart, the 2.8GHz Xeon with 2MB cache. That chip sells for $3,692 in quantities of 1,000.
"It absolutely will be kind of pricey," Brookwood said.
I'm looking for a HEPA media filter for my TV. I'm alergic to reality shows.
Nope, the only 'desktop' 64 bit processors come from IBM and AMD;
AMD Opteron
AMD Athlon64
IBM PPC970
Intel's 64 bit solutions is the Itanium! Anything with the Pentium moniker is 32 bit. The Itanium is the one which suffers 32 bit emulation lag.
So if you want 64 bit, you're stuck with, realistically, a Mac or some brand of Athlon CPU.
GPL Deconstructed
The processor-intensive software is already here. It is called HSpice, Verilog, fluid-dynamics simulation, etc. The Pentium 4 has done nicely in the engineering workstation market, and the "Extreme Edition" should do even better.
Please check the SPEC web site for a performance evaluation of the Pentium 4's floating-point (FP) performance. In particular, it outperforms the UltraSPARC III even though the latter has a 2-to-1 advantage in the width of its databus -- 64 bits versus 32 bits.
What changed the x86 chips from also-ran losers in FP performance to the kings of the hill? SSE.
The SSE extension to the x86 instruction set architecture (ISA) opened up a whole new world of applications for the Pentium III and successors. Older Pentiums were saddled with a FP stack that hurt their performance. The SSE extension established a directly addressable bank of 8 128-bit registers or 32 32-bit registers for FP operations. As a result, the Pentium 4 outperforms the UltraSPARC III on video applications.
At 3.2 GHz, the "Extreme Edition" of the Pentium 4 should help the Pentium 4 to capture even more of the engineering workstation market. Nowadays, the first-choice workstation among engineers in Silicon Valley and Boston's Route 128 is Linux running on a fast Pentium/Athlon, not Solaris lumbering on a slow UltraSPARC III.
The second part of their article is here.
Thank you. Drive through.
Computer architecture 101.
Average memory access latency per memory access =
(L1_hit_rate * L1_hit_cycle_time) +
(L1_miss_L2_hit_rate * L2_hit_cycle_time) +
(L2_miss_L3_hit_rate * L3_hit_cycle_time) +
(L3_miss_rate * DRAM_latency)
80-95% of your accesses will hit the 8k L1 in typical applications. This is the vast majority of the accesses. The latency of this cache is TINY on a P4. Do the math for a 3.2GHz 3 cycle cache.
Given a curve of cache-size vs. latency and hit rates for all the cache sizes, the optimal hierarchy is a simple optimization problem. I can assure you that this equation has been solved and the optimal heirarchy has been chosen (given the other constraints of obviously die-size and power).
Quadrupling the L1 will double the latency and kill your average access time, making your chip almost certainly slower.
Bigger caches mean longer latencies. It's limited by the basic laws of physics. There's only so much distance you can traverse in a ceratin amount of time and larger caches have longer distances (meaning higher RC delays).
The reason we want larger outer level caches is because the DRAM_latency is enourmous and has an impact on average access time. Hardware prefetching can also help to alleviate this problem - This solution is available on both Athlon and P4 chips and will only get better in the future because it is absolutely critical to hide this DRAM latency.
Ok, now to address the notion that more registers will improve performance...
You won't get as much performance out of more registers as you might think. First of all, when the compiler runs out of registers it spills the excess to the stack -- pushing it out with a store (spill) and reading it back in with a load (fill).
In modern processors (just about every chip out on the market), there is the concept of store buffers. Each store writes it's data to a store buffer. Subsequent loads that require data from stores, get their data by forwarding out of the store buffer. So -- the spilled store writes the buffer and the fill load reads the buffer -- all of this happening much faster than a memory access because it's just reading out a local on-chip buffer, so the load looks more like a fast register read. This architectural trick emulates the effect of having more registers, subject to the size of your store buffer. There are even more advanced architectural tricks you can play to completely eliminate the spill-fill pair from the critical path (look up memory-renaming in the literature).
If you're worried about chip-real estate, you should be very concerned that a 64-bit application's pointers will take up twice as much space effectively making your caches and memory bandwidth appear smaller.
Err, no. The internal data bus of the P4 is 256-bits wide, at least if you're talking about it's L2 cache bus. L1 cache doesn't really have a "bus", especially not the P4's trace cache (it's replacement for an L1 i-cache), but if my memory serves me correctly, the L1 d-cache of the P4 can read or write a pair of 64-byte values in 2 clock cycles. I guess that makes it's "bus" 128 bytes (not bits) wide. I don't know the bus width of this new L3 cache on this P4 "Extreme", aka a XeonMP, but I would guess it's 64-bits wide.
I haven't got a clue as to the internal data bus of the USIII, but I would guess that it's either 128-bit or 256-bit wide. Side note: the Power4 uses a MASSIVE 1024-bit wide internal bus, one of the reasons for it's impressive performance.
The only situation where the USIII has 64-bits and the P4 has 32-bits is if you are talking about integer registers or memory pointer width, neither of which are going to play a role in Spec CFP scores.
1) Get rid of 20-stage pipeline, it's too long for anything serious.
No its not. In fact, according to this research, the P4 pipeline is not deep enough. That paper concludes that P4 performance could be improved by up to 90% by increasing the pipeline depth to around 50 stages and increasing the cache size.
Do you actually think that Intel didn't know the consequences of increasing the pipeline depth? The Intel engineers didn't just guess on the P4 architecture- it was a very deliberate design decision. Judging by the P4's performance gains, it was a pretty good decision, too.
"The defense of freedom requires the advance of freedom" - George W Bush