Intel Demos New P4 'Extreme Edition'
typobox43 writes "Louis Burns of Intel displayed a "high-definition video stream running on a 'mystery' desktop processor." This processor turned out to be the new Intel Pentium 4 Extreme Edition 3.20 GHz, with an extra 2 Megabytes of cache."
"Only in their dreams can men truly be free 'twas always thus, and always thus will be."
--Tom Schulman
Intel Developer Forum Cache for questions
By Nebojsa Novakovic: Tuesday 16 September 2003, 18:14
WHEN, AT today's IDF opening, Louis Burns demonstrated a high-definition video stream running on a "mystery" desktop processor, everyone must hve thought it was the upcoming Prescott part. Wrong! It was the (also upcoming), previously unheard of, even at The Inq, Intel(R) Pentium(R) 4 processor Extreme Edition 3.20 GHz , with an extra 2 Megabytes of pron. In Intel's own words, "this new processor will be targeted at high-end gamers and computing power users."
As a matter of fact, 2MB cache will help a lot those users whose apps (including games and such) have a lot of big cache-friendly *wink* pieces of code and data, but probably not the data-streaming intensive stuff. I do expect to see speedups anywhere from 2% to 20% depending on the application, maybe some more if using multithreading/multitasking (large cache can keep in code / date pieces from more threads).
However, this doesn't seem to be a new CPU in reality - after all, Intel is doing very well with its XeonMP 2.8 GHz 2 MB cache CPU, and how much effort does it really take to repackage it for the 3.2 GHz / 800 FSB desktop with less stringent thermal and reliability requirements than the big iron, anyway?
Intel would gain a lot with this move. If, touch wood, there are problems with Prescott, a large-cache Pentium4 part will provide some buffer against large-cache Athlon64 (i.e. rebadged Opteron) parts. At the same time, enormous extra benefits from the economies of scale would further reduce the identical die XeonMP manufacturing cost, helping Intel compete better on the quad-CPU server front as well. Interesting move? I think so. Let's see how the beast performs in real!
Different pinout, so no.
$740 in 1,000 unit quantities. I think I'll pass.
The difference is that all the existing apps would need to be recompiled to fully use the 64bit. Even lowly DOS can use performance improvements with a larger cache. And with Hyperthreading the number of clocks per instruction is very small, this lends itself to using a larger cache more often.
See also:
Ars Technia on Caching
Some interesting quotes:
"The performance boost is awesome," Burns said Tuesday during a speech at the Intel Developer Forum here.
"It is a Xeon with a different pin-out, or least that's what it looks like to me," said Nathan Brookwood, an analyst at Insight 64.
Intel did not disclose the price of the Pentium 4 Extreme Edition. It likely will be as expensive as its counterpart, the 2.8GHz Xeon with 2MB cache. That chip sells for $3,692 in quantities of 1,000.
"It absolutely will be kind of pricey," Brookwood said.
I'm looking for a HEPA media filter for my TV. I'm alergic to reality shows.
Nope, the only 'desktop' 64 bit processors come from IBM and AMD;
AMD Opteron
AMD Athlon64
IBM PPC970
Intel's 64 bit solutions is the Itanium! Anything with the Pentium moniker is 32 bit. The Itanium is the one which suffers 32 bit emulation lag.
So if you want 64 bit, you're stuck with, realistically, a Mac or some brand of Athlon CPU.
GPL Deconstructed
You're thinking of the Itanium. It used to run 32bit X86 under a hardware emulator, but that was about as fast as the Pentium MMX. Intel has since switched to using a software emulator, something like Transmeta does with the Cruesoe, and it's actually faster than the hardware emulator, about the same speed as a Pentium III now.
The Xeon is a Pentium4 in different packaging and with SMP enabled. Actually, SMP is probably enabled with the Pentium4 too, but since there are no such motherboards and you can't plug them into Xeon DP mobos, nobody can test that. Xeons already come in versions that have up to 8MB of L3 cache, the new Pentium 4 is probably just a rebadged Xeon certified to run on an 800Mhz bus.
You're confused.
The Xeon series has always been Intel's "server" chips. Mostly a different pin out and lots more cache. They're souped up versions of the normal chips.
The Itanium is the 64-bit unit.
Learning HOW to think is more important than learning WHAT to think.
The processor-intensive software is already here. It is called HSpice, Verilog, fluid-dynamics simulation, etc. The Pentium 4 has done nicely in the engineering workstation market, and the "Extreme Edition" should do even better.
Please check the SPEC web site for a performance evaluation of the Pentium 4's floating-point (FP) performance. In particular, it outperforms the UltraSPARC III even though the latter has a 2-to-1 advantage in the width of its databus -- 64 bits versus 32 bits.
What changed the x86 chips from also-ran losers in FP performance to the kings of the hill? SSE.
The SSE extension to the x86 instruction set architecture (ISA) opened up a whole new world of applications for the Pentium III and successors. Older Pentiums were saddled with a FP stack that hurt their performance. The SSE extension established a directly addressable bank of 8 128-bit registers or 32 32-bit registers for FP operations. As a result, the Pentium 4 outperforms the UltraSPARC III on video applications.
At 3.2 GHz, the "Extreme Edition" of the Pentium 4 should help the Pentium 4 to capture even more of the engineering workstation market. Nowadays, the first-choice workstation among engineers in Silicon Valley and Boston's Route 128 is Linux running on a fast Pentium/Athlon, not Solaris lumbering on a slow UltraSPARC III.
The second part of their article is here.
Thank you. Drive through.
In particular, it outperforms the UltraSPARC III even though the latter has a 2-to-1 advantage in the width of its databus -- 64 bits versus 32 bits.
Err.. The P4 has a 64-bit data bus. The UltraSparcIII has quite a different databus (due to it's integrated memory controller), but when you look at memory bandwidth, the USIII has 2.4GB/s of memory bandwidth while the P4 has 6.4GB/s.
What changed the x86 chips from also-ran losers in FP performance to the kings of the hill? SSE.
Less than 5% of SpecFP scores make use of SSE. The performance comes mainly from the P4 having a lot of memory bandwidth. The only chips with more memory bandwidth are the Alpha 21364, the Power4, the Itanium2 and the Opteron. Ohh, take a guess as to which chips get higher SpecFP scores than the Pentium4 does.
Computer architecture 101.
Average memory access latency per memory access =
(L1_hit_rate * L1_hit_cycle_time) +
(L1_miss_L2_hit_rate * L2_hit_cycle_time) +
(L2_miss_L3_hit_rate * L3_hit_cycle_time) +
(L3_miss_rate * DRAM_latency)
80-95% of your accesses will hit the 8k L1 in typical applications. This is the vast majority of the accesses. The latency of this cache is TINY on a P4. Do the math for a 3.2GHz 3 cycle cache.
Given a curve of cache-size vs. latency and hit rates for all the cache sizes, the optimal hierarchy is a simple optimization problem. I can assure you that this equation has been solved and the optimal heirarchy has been chosen (given the other constraints of obviously die-size and power).
Quadrupling the L1 will double the latency and kill your average access time, making your chip almost certainly slower.
Bigger caches mean longer latencies. It's limited by the basic laws of physics. There's only so much distance you can traverse in a ceratin amount of time and larger caches have longer distances (meaning higher RC delays).
The reason we want larger outer level caches is because the DRAM_latency is enourmous and has an impact on average access time. Hardware prefetching can also help to alleviate this problem - This solution is available on both Athlon and P4 chips and will only get better in the future because it is absolutely critical to hide this DRAM latency.
Ok, now to address the notion that more registers will improve performance...
You won't get as much performance out of more registers as you might think. First of all, when the compiler runs out of registers it spills the excess to the stack -- pushing it out with a store (spill) and reading it back in with a load (fill).
In modern processors (just about every chip out on the market), there is the concept of store buffers. Each store writes it's data to a store buffer. Subsequent loads that require data from stores, get their data by forwarding out of the store buffer. So -- the spilled store writes the buffer and the fill load reads the buffer -- all of this happening much faster than a memory access because it's just reading out a local on-chip buffer, so the load looks more like a fast register read. This architectural trick emulates the effect of having more registers, subject to the size of your store buffer. There are even more advanced architectural tricks you can play to completely eliminate the spill-fill pair from the critical path (look up memory-renaming in the literature).
If you're worried about chip-real estate, you should be very concerned that a 64-bit application's pointers will take up twice as much space effectively making your caches and memory bandwidth appear smaller.
Gaming involves millions of complex instructions, which have to be split in to tiny bits to spread over the small bitwidth and reigster, using up clock cycles, 64 bit will allow you to more stuff without register switching all the time
#1- Instructions are not "split up" and put into registers, they are fetched from the instruction cache and decoded.
#2- You do realize that none of that has to do with 64 bit computing, don't you? Sure, AMD added some extra GP registers in its Athlon64, but that has nothing to do with 64 bits. The ONLY thing that 64 bit computing adds is higher memory addressing. It does not change your performance or speed or IPC of the processor in any way.
SSE is single-precision (32bit) floats only, so pretty useless for scientific calculations (usually require doubles).
However, I believe the intel compiler uses SSE2 (which can handle 64bit floats) exclusively for float code, since the P4 legacy fpu is just slow. Of course there are compiler switches for the compiler so the code also runs on good old Athlon, Athlon XP, PIII (which lack SSE2, the Athlon also lacks SSE) - and those aren't exactly slow doing float calculations neither.
You're both wrong. Xenon is a noble gas. Xeon is a 32 bit processor.
Don't blame me, I didn't vote for either of them!
He already explained it, it's not magic, it's mostly memory bandwidth. The size of that internal bus doesn't mean squat when it's sitting waiting for data from main memory.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.
Err, no. The internal data bus of the P4 is 256-bits wide, at least if you're talking about it's L2 cache bus. L1 cache doesn't really have a "bus", especially not the P4's trace cache (it's replacement for an L1 i-cache), but if my memory serves me correctly, the L1 d-cache of the P4 can read or write a pair of 64-byte values in 2 clock cycles. I guess that makes it's "bus" 128 bytes (not bits) wide. I don't know the bus width of this new L3 cache on this P4 "Extreme", aka a XeonMP, but I would guess it's 64-bits wide.
I haven't got a clue as to the internal data bus of the USIII, but I would guess that it's either 128-bit or 256-bit wide. Side note: the Power4 uses a MASSIVE 1024-bit wide internal bus, one of the reasons for it's impressive performance.
The only situation where the USIII has 64-bits and the P4 has 32-bits is if you are talking about integer registers or memory pointer width, neither of which are going to play a role in Spec CFP scores.
In the next year we'll see the first solid state hard drives (Some that will run fast or faster than the processor) and faster RAM that would run the same speed as the processor.
Cache on a processor would be redundant if you can access the RAM at the same speeds. AMD is aware of this and are working to make compatible products.
Solid state drive/memory that runs at compatible speeds as the processor will probably reduce the need for what we call ram these days and operating systems could just use the drive for it's RAM.
If you're thinking of buying the latest and greatest I'd wait. Many things are about to happen and it'll be worth just keeping the money in the bank for now. Most people in the 2ghz range dont need any upgrades right now unless it's in the graphics department. I bought a ATI 9700 Pro last Janurary and it made more of a difference in my games than having any faster processor could.
Lets look at some numbers...
:)
What is the standard time slice quantum for windows and linux typically? That is to say, what is the typical rate of context switches? If I recall correctly, it's on the order of 100 per second.
That's 1 context switch every 0.01 seconds. Lets suppose now that I have a typical P4 system with 6.4GB/sec of memory bandwidth. I can fill the entire 2M cache in roughly
0.002/6.4 seconds = 0.0003125 seconds
That's only 3% of the entire time slice quantum! That's assuming the thread will want all of the cache (which is unlikely for many apps).
So, yes you may be running 25 programs, but they only switch between each other 100 times a second. The cache re-fill only takes 3% of a minimum quantum, leaving 97% of the time left to read the filled data out of your caches (which is highly likely due to the principle of locality and LRU replacement policies).
Now, if you're running something server-like TPCC with a bazillion threads, then you'll probably context switch a lot more, but the P4 isn't designed for transactional servers. This is why servers systems have larger caches and multiple processors and multiple SMT threads within each processor -- among many other reasons
Putting the DRAM on die is a nice idea, but it's not cost effective at this time. You only have a finite die size. The cost of the die increases exponentially with area, so there's a hard cap on how big you can make the chip. The area that you would have spent on the DRAM could be used more profitably on other chip features. Perhaps in the future when we get many more transistors, it may be feasible.
the P4's archtecture (sp) is such that it is incredibly sensitive to cache misses due to its long
pipeline (20 i believe). Thats why higher memory bandwidth and larger caches make such a huge difference on the P4. Where as the Athlons have a much shorter pipeline (12 i believe), the extra memory and cache dont help out as much.
Lawyers, MBA's, RIAA? A jedi fears not these things!
1) Get rid of 20-stage pipeline, it's too long for anything serious.
No its not. In fact, according to this research, the P4 pipeline is not deep enough. That paper concludes that P4 performance could be improved by up to 90% by increasing the pipeline depth to around 50 stages and increasing the cache size.
Do you actually think that Intel didn't know the consequences of increasing the pipeline depth? The Intel engineers didn't just guess on the P4 architecture- it was a very deliberate design decision. Judging by the P4's performance gains, it was a pretty good decision, too.
"The defense of freedom requires the advance of freedom" - George W Bush
What is new is a 2MB L3 cache. It's made clear that this is in addition to the existing 512kB L2 cache on current P4s, making a total cache size of 2.5MB
1) Get rid of 20-stage pipeline, it's too long for anything serious.
No its not. It enables a high clock rate and with good branch prediction and selective replays, it is just fine.
2) As a follow up to 1, try to actually get some work done in a clock cycle.
Read some studies about the available ILP (instruction-level parallelism) in common applications. There really isn't much of it unless instruction windows are made huge which isn't feasible. This is why simulaneous-multithreading (hyperthreading) made it into an actual chip because it takes advantage of low ILP.
3) Throw out the x86 ISA.
Pentium4 is the leader of SpecINT. Not AMD, not Sun (RISC), not IBM (RISC), not MIPS (RISC). Some players, such as MIPS, don't have the resources to compete however IBM does.
And look whats happening to Itanium? Disaster even with its oh so elegant ISA.
4) Look at the MIPS ISA.
What about it? Yes, very clean and orthogonal. Intel and AMD have proven that an ISA is irrelevant to achieving high performance.
5) Realize that it's actually possible to understand the MIPS arch, and that it still works great for multimedia, math, and general use.
Realize that undergraduate computer architecture is simply an introduction.
6) Buy the rights to the MIPS ISA, make small improvements (get rid of branch delay slot, load delay slot), speed it up, and design new Intel processors from the improved ISA.
Unnecessary. Besides, I like software compatibility and binary translation just doesn't work at this level.
7) Release versions of processors with 4MB Cache (2MB each I$, D$) for consumers, and 24MB Cache (8MB I$, 16MB D$) for servers/clustering/etc.
2MB each for I$ and D$? Then you must be referring to L1 caches which are the only ones typically separated into data and instruction. And how many cycles would it take to access those caches?
8) Release Motherboards for 1, 2, and 4 CPU configurations.
9)
10) Profit!
Read Intel's annual report. They are quite profitable already and don't need the advice from someone with a B.S. in EE or CS and marvels at how great the MIPS ISA is.
Well, honestly, nowadays "it beats a Sun box" doesn't even say much. It's just marginally more meaningful than "it beats my old ZX Spectrum."
The Suns still struggle barely above 1 GHz, have a slow cache, and so on. It also doesn't help that they're still saddled with SDRAM memory, too. (At least in the case of the cheaper workstations, on a 32 bit memory bus too.) If we're talking programs that draw something, it also doesn't help that they're saddled with outdated _and_ overpriced video cards. And so on.
Even without SSE, there's no way in heck for that UltraSparc III to keep up with a P4. E.g., Sun's Java doesn't even generate SSE code, and it still runs faster on Windows than on Solaris. Go figure.
For all the BS about the advantages of 64 bits, the reality is that in 64 bit mode an UltraSparc actually runs _slower_. So be thankful that most of the apps for it (and certainly all benchmarks) really are compiled in 32 bit mode.
Frankly, other than a few PHBs, and a couple of people who think they're some form of resitance against Wintel if they buy Suns, the rest of us don't even consider Sun to still be in the race any more.
So yeah, your words about running Linux on an Athlon or Pentium reflect exactly what I'd say to anyone considering a Sun box: Get the cheapest PC that Dell sells, or build your own Duron system, install Linux on it, and there you go. You now have a Unix workstation, and it runs circles around any of Sun's workstations. Or, much as I'm no Mac fan, get a Mac. It'll be 64 bit, and based on BSD too.
A polar bear is a cartesian bear after a coordinate transform.