Slashdot Mirror


Intel Demos New P4 'Extreme Edition'

typobox43 writes "Louis Burns of Intel displayed a "high-definition video stream running on a 'mystery' desktop processor." This processor turned out to be the new Intel Pentium 4 Extreme Edition 3.20 GHz, with an extra 2 Megabytes of cache."

27 of 393 comments (clear)

  1. Level Three Cache by Master+Bait · · Score: 4, Informative
    Ho hum. I suppose if it was level two cache, Intel would have said so very loudly, so they just call it 'cache'.

    --
    "Only in their dreams can men truly be free 'twas always thus, and always thus will be."
    --Tom Schulman
    1. Re:Level Three Cache by philthedrill · · Score: 2, Informative

      The tradeoff with cache hierarchy is access time on hits vs. size.

      You could increase the size of your L2 (or L1 for that matter), but you do this at the risk of sacrificing cycle time. This is part of the reason the P4 has such a tiny L1 D-cache.

      With clock speeds climbing higher, the amount of time for a signal to traverse across a chip is no longer trivial, so retrieving data within N clock cycles is unrealistic with a large cache.

      To add to that, the benefit over an L3 hit (even though it's much slower than L1 or L2) is that it's still much faster than main memory. DRAM is built for capacity.

      Adding cache is somewhat of an easy way out in terms of adding performance with your transistor budget. You keep your power density reasonable and you also don't change the microarchitecture.

      In conclusion, I think drugs are bad. The end.

  2. text incase of /.ing by Anonymous Coward · · Score: 5, Informative

    Intel Developer Forum Cache for questions

    By Nebojsa Novakovic: Tuesday 16 September 2003, 18:14
    WHEN, AT today's IDF opening, Louis Burns demonstrated a high-definition video stream running on a "mystery" desktop processor, everyone must hve thought it was the upcoming Prescott part. Wrong! It was the (also upcoming), previously unheard of, even at The Inq, Intel(R) Pentium(R) 4 processor Extreme Edition 3.20 GHz , with an extra 2 Megabytes of pron. In Intel's own words, "this new processor will be targeted at high-end gamers and computing power users."

    As a matter of fact, 2MB cache will help a lot those users whose apps (including games and such) have a lot of big cache-friendly *wink* pieces of code and data, but probably not the data-streaming intensive stuff. I do expect to see speedups anywhere from 2% to 20% depending on the application, maybe some more if using multithreading/multitasking (large cache can keep in code / date pieces from more threads).

    However, this doesn't seem to be a new CPU in reality - after all, Intel is doing very well with its XeonMP 2.8 GHz 2 MB cache CPU, and how much effort does it really take to repackage it for the 3.2 GHz / 800 FSB desktop with less stringent thermal and reliability requirements than the big iron, anyway?

    Intel would gain a lot with this move. If, touch wood, there are problems with Prescott, a large-cache Pentium4 part will provide some buffer against large-cache Athlon64 (i.e. rebadged Opteron) parts. At the same time, enormous extra benefits from the economies of scale would further reduce the identical die XeonMP manufacturing cost, helping Intel compete better on the quad-CPU server front as well. Interesting move? I think so. Let's see how the beast performs in real!

  3. Re:Multiprocessor? by loopWork · · Score: 3, Informative

    Different pinout, so no.

  4. Extreme price... by Anonymous Coward · · Score: 5, Informative

    $740 in 1,000 unit quantities. I think I'll pass.

  5. Re:More impressed with AMD. by C.+Mattix · · Score: 4, Informative

    The difference is that all the existing apps would need to be recompiled to fully use the 64bit. Even lowly DOS can use performance improvements with a larger cache. And with Hyperthreading the number of clocks per instruction is very small, this lends itself to using a larger cache more often.

    See also:

    Ars Technia on Caching

  6. CNET article with more details by HeroicAutobot · · Score: 4, Informative
    CNET has an article with more details (or speculation more likely).

    Some interesting quotes:

    "The performance boost is awesome," Burns said Tuesday during a speech at the Intel Developer Forum here.

    "It is a Xeon with a different pin-out, or least that's what it looks like to me," said Nathan Brookwood, an analyst at Insight 64.

    Intel did not disclose the price of the Pentium 4 Extreme Edition. It likely will be as expensive as its counterpart, the 2.8GHz Xeon with 2MB cache. That chip sells for $3,692 in quantities of 1,000.

    "It absolutely will be kind of pricey," Brookwood said.

    --
    I'm looking for a HEPA media filter for my TV. I'm alergic to reality shows.
  7. Re:64bit vs 32bit by 2nd+Post! · · Score: 4, Informative

    Nope, the only 'desktop' 64 bit processors come from IBM and AMD;
    AMD Opteron
    AMD Athlon64
    IBM PPC970

    Intel's 64 bit solutions is the Itanium! Anything with the Pentium moniker is 32 bit. The Itanium is the one which suffers 32 bit emulation lag.

    So if you want 64 bit, you're stuck with, realistically, a Mac or some brand of Athlon CPU.

  8. Re:64bit vs 32bit by Naito · · Score: 3, Informative

    You're thinking of the Itanium. It used to run 32bit X86 under a hardware emulator, but that was about as fast as the Pentium MMX. Intel has since switched to using a software emulator, something like Transmeta does with the Cruesoe, and it's actually faster than the hardware emulator, about the same speed as a Pentium III now.

    The Xeon is a Pentium4 in different packaging and with SMP enabled. Actually, SMP is probably enabled with the Pentium4 too, but since there are no such motherboards and you can't plug them into Xeon DP mobos, nobody can test that. Xeons already come in versions that have up to 8MB of L3 cache, the new Pentium 4 is probably just a rebadged Xeon certified to run on an 800Mhz bus.

  9. Re:64bit vs 32bit by chill · · Score: 2, Informative

    You're confused.

    The Xeon series has always been Intel's "server" chips. Mostly a different pin out and lots more cache. They're souped up versions of the normal chips.

    The Itanium is the 64-bit unit.

    --
    Learning HOW to think is more important than learning WHAT to think.
  10. Processor-Intensive SW: Engineering Applications by reporter · · Score: 4, Informative
    This "extreme" version of the chip has to be aimed at a very niche market, at least for the next couple of years until more processor intensive software catches up.

    The processor-intensive software is already here. It is called HSpice, Verilog, fluid-dynamics simulation, etc. The Pentium 4 has done nicely in the engineering workstation market, and the "Extreme Edition" should do even better.

    Please check the SPEC web site for a performance evaluation of the Pentium 4's floating-point (FP) performance. In particular, it outperforms the UltraSPARC III even though the latter has a 2-to-1 advantage in the width of its databus -- 64 bits versus 32 bits.

    What changed the x86 chips from also-ran losers in FP performance to the kings of the hill? SSE.

    The SSE extension to the x86 instruction set architecture (ISA) opened up a whole new world of applications for the Pentium III and successors. Older Pentiums were saddled with a FP stack that hurt their performance. The SSE extension established a directly addressable bank of 8 128-bit registers or 32 32-bit registers for FP operations. As a result, the Pentium 4 outperforms the UltraSPARC III on video applications.

    At 3.2 GHz, the "Extreme Edition" of the Pentium 4 should help the Pentium 4 to capture even more of the engineering workstation market. Nowadays, the first-choice workstation among engineers in Silicon Valley and Boston's Route 128 is Linux running on a fast Pentium/Athlon, not Solaris lumbering on a slow UltraSPARC III.

    ... from the desk of the reporter

  11. Part 2 of Article up now by dubiousdave · · Score: 5, Informative

    The second part of their article is here.

    --
    Thank you. Drive through.
  12. Re:Processor-Intensive SW: Engineering Application by Hoser+McMoose · · Score: 2, Informative

    In particular, it outperforms the UltraSPARC III even though the latter has a 2-to-1 advantage in the width of its databus -- 64 bits versus 32 bits.

    Err.. The P4 has a 64-bit data bus. The UltraSparcIII has quite a different databus (due to it's integrated memory controller), but when you look at memory bandwidth, the USIII has 2.4GB/s of memory bandwidth while the P4 has 6.4GB/s.

    What changed the x86 chips from also-ran losers in FP performance to the kings of the hill? SSE.

    Less than 5% of SpecFP scores make use of SSE. The performance comes mainly from the P4 having a lot of memory bandwidth. The only chips with more memory bandwidth are the Alpha 21364, the Power4, the Itanium2 and the Opteron. Ohh, take a guess as to which chips get higher SpecFP scores than the Pentium4 does.

  13. Re:Tom's Hardware reviewed a similar Xeon... by akuma(x86) · · Score: 5, Informative

    Computer architecture 101.

    Average memory access latency per memory access =
    (L1_hit_rate * L1_hit_cycle_time) +
    (L1_miss_L2_hit_rate * L2_hit_cycle_time) +
    (L2_miss_L3_hit_rate * L3_hit_cycle_time) +
    (L3_miss_rate * DRAM_latency)

    80-95% of your accesses will hit the 8k L1 in typical applications. This is the vast majority of the accesses. The latency of this cache is TINY on a P4. Do the math for a 3.2GHz 3 cycle cache.

    Given a curve of cache-size vs. latency and hit rates for all the cache sizes, the optimal hierarchy is a simple optimization problem. I can assure you that this equation has been solved and the optimal heirarchy has been chosen (given the other constraints of obviously die-size and power).

    Quadrupling the L1 will double the latency and kill your average access time, making your chip almost certainly slower.

    Bigger caches mean longer latencies. It's limited by the basic laws of physics. There's only so much distance you can traverse in a ceratin amount of time and larger caches have longer distances (meaning higher RC delays).

    The reason we want larger outer level caches is because the DRAM_latency is enourmous and has an impact on average access time. Hardware prefetching can also help to alleviate this problem - This solution is available on both Athlon and P4 chips and will only get better in the future because it is absolutely critical to hide this DRAM latency.

    Ok, now to address the notion that more registers will improve performance...
    You won't get as much performance out of more registers as you might think. First of all, when the compiler runs out of registers it spills the excess to the stack -- pushing it out with a store (spill) and reading it back in with a load (fill).

    In modern processors (just about every chip out on the market), there is the concept of store buffers. Each store writes it's data to a store buffer. Subsequent loads that require data from stores, get their data by forwarding out of the store buffer. So -- the spilled store writes the buffer and the fill load reads the buffer -- all of this happening much faster than a memory access because it's just reading out a local on-chip buffer, so the load looks more like a fast register read. This architectural trick emulates the effect of having more registers, subject to the size of your store buffer. There are even more advanced architectural tricks you can play to completely eliminate the spill-fill pair from the critical path (look up memory-renaming in the literature).

    If you're worried about chip-real estate, you should be very concerned that a 64-bit application's pointers will take up twice as much space effectively making your caches and memory bandwidth appear smaller.

  14. Re:Pentium is dying! by Anonymous Coward · · Score: 1, Informative

    Gaming involves millions of complex instructions, which have to be split in to tiny bits to spread over the small bitwidth and reigster, using up clock cycles, 64 bit will allow you to more stuff without register switching all the time

    #1- Instructions are not "split up" and put into registers, they are fetched from the instruction cache and decoded.

    #2- You do realize that none of that has to do with 64 bit computing, don't you? Sure, AMD added some extra GP registers in its Athlon64, but that has nothing to do with 64 bits. The ONLY thing that 64 bit computing adds is higher memory addressing. It does not change your performance or speed or IPC of the processor in any way.

  15. Re:Processor-Intensive SW: Engineering Application by mczak · · Score: 3, Informative

    SSE is single-precision (32bit) floats only, so pretty useless for scientific calculations (usually require doubles).
    However, I believe the intel compiler uses SSE2 (which can handle 64bit floats) exclusively for float code, since the P4 legacy fpu is just slow. Of course there are compiler switches for the compiler so the code also runs on good old Athlon, Athlon XP, PIII (which lack SSE2, the Athlon also lacks SSE) - and those aren't exactly slow doing float calculations neither.

  16. Re:64bit vs 32bit by Brandybuck · · Score: 2, Informative

    You're both wrong. Xenon is a noble gas. Xeon is a 32 bit processor.

    --
    Don't blame me, I didn't vote for either of them!
  17. Re:Databus of Pentium 4 is 32 bits, not 64 bits. by Arker · · Score: 2, Informative

    He already explained it, it's not magic, it's mostly memory bandwidth. The size of that internal bus doesn't mean squat when it's sitting waiting for data from main memory.

    --
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-
    Friends don't let friends enable ecmascript.
  18. Re:Databus of Pentium 4 is 32 bits, not 64 bits. by Hoser+McMoose · · Score: 4, Informative

    Err, no. The internal data bus of the P4 is 256-bits wide, at least if you're talking about it's L2 cache bus. L1 cache doesn't really have a "bus", especially not the P4's trace cache (it's replacement for an L1 i-cache), but if my memory serves me correctly, the L1 d-cache of the P4 can read or write a pair of 64-byte values in 2 clock cycles. I guess that makes it's "bus" 128 bytes (not bits) wide. I don't know the bus width of this new L3 cache on this P4 "Extreme", aka a XeonMP, but I would guess it's 64-bits wide.

    I haven't got a clue as to the internal data bus of the USIII, but I would guess that it's either 128-bit or 256-bit wide. Side note: the Power4 uses a MASSIVE 1024-bit wide internal bus, one of the reasons for it's impressive performance.

    The only situation where the USIII has 64-bits and the P4 has 32-bits is if you are talking about integer registers or memory pointer width, neither of which are going to play a role in Spec CFP scores.

  19. Speed / Cache is irrelevant *soon* by Bruha · · Score: 1, Informative

    In the next year we'll see the first solid state hard drives (Some that will run fast or faster than the processor) and faster RAM that would run the same speed as the processor.

    Cache on a processor would be redundant if you can access the RAM at the same speeds. AMD is aware of this and are working to make compatible products.

    Solid state drive/memory that runs at compatible speeds as the processor will probably reduce the need for what we call ram these days and operating systems could just use the drive for it's RAM.

    If you're thinking of buying the latest and greatest I'd wait. Many things are about to happen and it'll be worth just keeping the money in the bank for now. Most people in the 2ghz range dont need any upgrades right now unless it's in the graphics department. I bought a ATI 9700 Pro last Janurary and it made more of a difference in my games than having any faster processor could.

    1. Re:Speed / Cache is irrelevant *soon* by darkwiz · · Score: 3, Informative
      In the next year we'll see the first solid state hard drives (Some that will run fast or faster than the processor) and faster RAM that would run the same speed as the processor.

      Cache on a processor would be redundant if you can access the RAM at the same speeds. AMD is aware of this and are working to make compatible products.


      No.... we won't. What you are describing is insane. Come on: 3.2GHz x 32 bits? Access/transfer times over a full scale bus with a latency in picoseconds? Um... no.

      There is a reason no one has done that yet - made system RAM the same speed as the CPU - and it ain't economics: it is physics. Nature does not take bribes.

      Look, it isn't that it is too expensive to make fast RAM. And it isn't the distance - it is the capacitance. The problem with fast RAM is getting that signal off chip to the CPU. And the wires that connect the RAM and CPU are orders of magnitude higher capacitance than the wires on chip. That is a fundamental problem which you won't overcome without a fundamental change in how you move the data around.


      Solid state drive/memory that runs at compatible speeds as the processor will probably reduce the need for what we call ram these days and operating systems could just use the drive for it's RAM.


      Um.. no. Never will that be the case except in situations where using an archaicly small amount of processing power is adequate. Storage technology, as it is formulated now, cannot approach the speed of access, communication, and storage that even a low grade CPU would use for cache.

      Maybe - MAYBE when we are using diamond wafers, high-temperature-superconductor-nanotube-quantum-d ot wires, and other buzzwords.

  20. Re:Tom's Hardware reviewed a similar Xeon... by akuma(x86) · · Score: 2, Informative

    Lets look at some numbers...

    What is the standard time slice quantum for windows and linux typically? That is to say, what is the typical rate of context switches? If I recall correctly, it's on the order of 100 per second.

    That's 1 context switch every 0.01 seconds. Lets suppose now that I have a typical P4 system with 6.4GB/sec of memory bandwidth. I can fill the entire 2M cache in roughly

    0.002/6.4 seconds = 0.0003125 seconds

    That's only 3% of the entire time slice quantum! That's assuming the thread will want all of the cache (which is unlikely for many apps).

    So, yes you may be running 25 programs, but they only switch between each other 100 times a second. The cache re-fill only takes 3% of a minimum quantum, leaving 97% of the time left to read the filled data out of your caches (which is highly likely due to the principle of locality and LRU replacement policies).

    Now, if you're running something server-like TPCC with a bazillion threads, then you'll probably context switch a lot more, but the P4 isn't designed for transactional servers. This is why servers systems have larger caches and multiple processors and multiple SMT threads within each processor -- among many other reasons :)

    Putting the DRAM on die is a nice idea, but it's not cost effective at this time. You only have a finite die size. The cost of the die increases exponentially with area, so there's a hard cap on how big you can make the chip. The area that you would have spent on the DRAM could be used more profitably on other chip features. Perhaps in the future when we get many more transistors, it may be feasible.

  21. Re:Am I missing something? by Indy1 · · Score: 2, Informative

    the P4's archtecture (sp) is such that it is incredibly sensitive to cache misses due to its long
    pipeline (20 i believe). Thats why higher memory bandwidth and larger caches make such a huge difference on the P4. Where as the Athlons have a much shorter pipeline (12 i believe), the extra memory and cache dont help out as much.

    --
    Lawyers, MBA's, RIAA? A jedi fears not these things!
  22. Re:Possible Advertising Campaign? by cheezedawg · · Score: 4, Informative

    1) Get rid of 20-stage pipeline, it's too long for anything serious.

    No its not. In fact, according to this research, the P4 pipeline is not deep enough. That paper concludes that P4 performance could be improved by up to 90% by increasing the pipeline depth to around 50 stages and increasing the cache size.

    Do you actually think that Intel didn't know the consequences of increasing the pipeline depth? The Intel engineers didn't just guess on the P4 architecture- it was a very deliberate design decision. Judging by the P4's performance gains, it was a pretty good decision, too.

    --
    "The defense of freedom requires the advance of freedom" - George W Bush
  23. Re:extra 2 mb's? by Bored+Huge+Krill · · Score: 2, Informative
    yes, it is in addition. There's a report at Anandtech that writes up some of the details

    What is new is a 2MB L3 cache. It's made clear that this is in addition to the existing 512kB L2 cache on current P4s, making a total cache size of 2.5MB

  24. Re:Possible Advertising Campaign? by mrm677 · · Score: 2, Informative


    1) Get rid of 20-stage pipeline, it's too long for anything serious.


    No its not. It enables a high clock rate and with good branch prediction and selective replays, it is just fine.


    2) As a follow up to 1, try to actually get some work done in a clock cycle.


    Read some studies about the available ILP (instruction-level parallelism) in common applications. There really isn't much of it unless instruction windows are made huge which isn't feasible. This is why simulaneous-multithreading (hyperthreading) made it into an actual chip because it takes advantage of low ILP.


    3) Throw out the x86 ISA.


    Pentium4 is the leader of SpecINT. Not AMD, not Sun (RISC), not IBM (RISC), not MIPS (RISC). Some players, such as MIPS, don't have the resources to compete however IBM does.

    And look whats happening to Itanium? Disaster even with its oh so elegant ISA.


    4) Look at the MIPS ISA.


    What about it? Yes, very clean and orthogonal. Intel and AMD have proven that an ISA is irrelevant to achieving high performance.


    5) Realize that it's actually possible to understand the MIPS arch, and that it still works great for multimedia, math, and general use.


    Realize that undergraduate computer architecture is simply an introduction.


    6) Buy the rights to the MIPS ISA, make small improvements (get rid of branch delay slot, load delay slot), speed it up, and design new Intel processors from the improved ISA.


    Unnecessary. Besides, I like software compatibility and binary translation just doesn't work at this level.


    7) Release versions of processors with 4MB Cache (2MB each I$, D$) for consumers, and 24MB Cache (8MB I$, 16MB D$) for servers/clustering/etc.


    2MB each for I$ and D$? Then you must be referring to L1 caches which are the only ones typically separated into data and instruction. And how many cycles would it take to access those caches?


    8) Release Motherboards for 1, 2, and 4 CPU configurations.
    9) ...
    10) Profit!


    Read Intel's annual report. They are quite profitable already and don't need the advice from someone with a B.S. in EE or CS and marvels at how great the MIPS ISA is.

  25. Re:Processor-Intensive SW: Engineering Application by Moraelin · · Score: 2, Informative

    Well, honestly, nowadays "it beats a Sun box" doesn't even say much. It's just marginally more meaningful than "it beats my old ZX Spectrum."

    The Suns still struggle barely above 1 GHz, have a slow cache, and so on. It also doesn't help that they're still saddled with SDRAM memory, too. (At least in the case of the cheaper workstations, on a 32 bit memory bus too.) If we're talking programs that draw something, it also doesn't help that they're saddled with outdated _and_ overpriced video cards. And so on.

    Even without SSE, there's no way in heck for that UltraSparc III to keep up with a P4. E.g., Sun's Java doesn't even generate SSE code, and it still runs faster on Windows than on Solaris. Go figure.

    For all the BS about the advantages of 64 bits, the reality is that in 64 bit mode an UltraSparc actually runs _slower_. So be thankful that most of the apps for it (and certainly all benchmarks) really are compiled in 32 bit mode.

    Frankly, other than a few PHBs, and a couple of people who think they're some form of resitance against Wintel if they buy Suns, the rest of us don't even consider Sun to still be in the race any more.

    So yeah, your words about running Linux on an Athlon or Pentium reflect exactly what I'd say to anyone considering a Sun box: Get the cheapest PC that Dell sells, or build your own Duron system, install Linux on it, and there you go. You now have a Unix workstation, and it runs circles around any of Sun's workstations. Or, much as I'm no Mac fan, get a Mac. It'll be 64 bit, and based on BSD too.

    --
    A polar bear is a cartesian bear after a coordinate transform.