Intel Demos New P4 'Extreme Edition'

← Back to Stories (view on slashdot.org)

Intel Demos New P4 'Extreme Edition'

Posted by ryuzaki0 on Wednesday September 17, 2003 @09:35AM from the extremely-lame-name dept.

typobox43 writes "Louis Burns of Intel displayed a "high-definition video stream running on a 'mystery' desktop processor." This processor turned out to be the new Intel Pentium 4 Extreme Edition 3.20 GHz, with an extra 2 Megabytes of cache."

4 of 393 comments (clear)

Min score:

Reason:

Sort:

text incase of /.ing by Anonymous Coward · 2003-09-17 09:41 · Score: 5, Informative

Intel Developer Forum Cache for questions

By Nebojsa Novakovic: Tuesday 16 September 2003, 18:14
WHEN, AT today's IDF opening, Louis Burns demonstrated a high-definition video stream running on a "mystery" desktop processor, everyone must hve thought it was the upcoming Prescott part. Wrong! It was the (also upcoming), previously unheard of, even at The Inq, Intel(R) Pentium(R) 4 processor Extreme Edition 3.20 GHz , with an extra 2 Megabytes of pron. In Intel's own words, "this new processor will be targeted at high-end gamers and computing power users."

As a matter of fact, 2MB cache will help a lot those users whose apps (including games and such) have a lot of big cache-friendly *wink* pieces of code and data, but probably not the data-streaming intensive stuff. I do expect to see speedups anywhere from 2% to 20% depending on the application, maybe some more if using multithreading/multitasking (large cache can keep in code / date pieces from more threads).

However, this doesn't seem to be a new CPU in reality - after all, Intel is doing very well with its XeonMP 2.8 GHz 2 MB cache CPU, and how much effort does it really take to repackage it for the 3.2 GHz / 800 FSB desktop with less stringent thermal and reliability requirements than the big iron, anyway?

Intel would gain a lot with this move. If, touch wood, there are problems with Prescott, a large-cache Pentium4 part will provide some buffer against large-cache Athlon64 (i.e. rebadged Opteron) parts. At the same time, enormous extra benefits from the economies of scale would further reduce the identical die XeonMP manufacturing cost, helping Intel compete better on the quad-CPU server front as well. Interesting move? I think so. Let's see how the beast performs in real!
Extreme price... by Anonymous Coward · 2003-09-17 09:47 · Score: 5, Informative

$740 in 1,000 unit quantities. I think I'll pass.
Part 2 of Article up now by dubiousdave · 2003-09-17 10:31 · Score: 5, Informative

The second part of their article is here.

--
Thank you. Drive through.
Re:Tom's Hardware reviewed a similar Xeon... by akuma(x86) · 2003-09-17 11:11 · Score: 5, Informative

Computer architecture 101.

Average memory access latency per memory access =
(L1_hit_rate * L1_hit_cycle_time) +
(L1_miss_L2_hit_rate * L2_hit_cycle_time) +
(L2_miss_L3_hit_rate * L3_hit_cycle_time) +
(L3_miss_rate * DRAM_latency)

80-95% of your accesses will hit the 8k L1 in typical applications. This is the vast majority of the accesses. The latency of this cache is TINY on a P4. Do the math for a 3.2GHz 3 cycle cache.

Given a curve of cache-size vs. latency and hit rates for all the cache sizes, the optimal hierarchy is a simple optimization problem. I can assure you that this equation has been solved and the optimal heirarchy has been chosen (given the other constraints of obviously die-size and power).

Quadrupling the L1 will double the latency and kill your average access time, making your chip almost certainly slower.

Bigger caches mean longer latencies. It's limited by the basic laws of physics. There's only so much distance you can traverse in a ceratin amount of time and larger caches have longer distances (meaning higher RC delays).

The reason we want larger outer level caches is because the DRAM_latency is enourmous and has an impact on average access time. Hardware prefetching can also help to alleviate this problem - This solution is available on both Athlon and P4 chips and will only get better in the future because it is absolutely critical to hide this DRAM latency.

Ok, now to address the notion that more registers will improve performance...
You won't get as much performance out of more registers as you might think. First of all, when the compiler runs out of registers it spills the excess to the stack -- pushing it out with a store (spill) and reading it back in with a load (fill).

In modern processors (just about every chip out on the market), there is the concept of store buffers. Each store writes it's data to a store buffer. Subsequent loads that require data from stores, get their data by forwarding out of the store buffer. So -- the spilled store writes the buffer and the fill load reads the buffer -- all of this happening much faster than a memory access because it's just reading out a local on-chip buffer, so the load looks more like a fast register read. This architectural trick emulates the effect of having more registers, subject to the size of your store buffer. There are even more advanced architectural tricks you can play to completely eliminate the spill-fill pair from the critical path (look up memory-renaming in the literature).

If you're worried about chip-real estate, you should be very concerned that a 64-bit application's pointers will take up twice as much space effectively making your caches and memory bandwidth appear smaller.