Slashdot Mirror


Dual Caches for Dual-core Chips

DominoTree writes "The dual-core chips that AMD and Intel plan to bring to market next year won't be sharing their memories. A version of Opteron coming in 2005 and Montecito, a future member of Intel's Itanium family also slated for next year, will both have two processor cores, the actual unit inside a processor that performs the calculations, and each core will have separate caches."

18 of 342 comments (clear)

  1. Note: Here, Single is Better by Anonymous Coward · · Score: 5, Informative

    In case it's not obvious to those who didn't read the article all the way through, it's a better thing when the memory is shared (single cache) rather than separate (dual cache). But that is harder to design, so for these first-generation dual-core chips from Intel and AMD, they are using separate caches for each core. (IBM's dual core Power4 processor has a unified cache.) At some point down the road, they will likely unify them to increase performance.

    1. Re:Note: Here, Single is Better by spuzzzzzzz · · Score: 5, Informative
      Are there situations where two caches might be better? For example, a multi-threaded application with two memory-intensive threads, each locked down onto a specific CPU?

      Not really. The problem with 2 caches is duplication. It is quite probable that both cores will want to work on the same thing, in which case cache space will be wasted. It also creates timing complications when one core wants to write to its cache because the other core will have to be told to invalidate its relevant cache entry. On the other hand, you could create a single cache with double the size. This would make sharing memory between CPUs simpler and it wouldn't significantly increase access times (so the situation you mentioned wouldn't be affected). The argument for double caches is about cost, scalability and design simplicity, not performance.

      --

      Don't you hate meta-sigs?
    2. Re:Note: Here, Single is Better by jackb_guppy · · Score: 4, Informative

      The PPC4 does not have single cache...

      There a L1 caches for both cores.

      There are 3 L2 caches hooked to cross bar switch for speed flowing data into and out of the L1

      There is a single L3 controller overseeing 2 L3 external memory banks.

      Then there is two busses to 2 main memory.

      And 3 interconnects to 3 other dual core chips that make a single 8way processor block.

      And 4 busses inter connecting 4 of these 8way to make a 32way machine, with dual IO channels to hardware!

    3. Re:Note: Here, Single is Better by hattig · · Score: 4, Informative

      No no no no.

      That's all wrong.

      The Opteron has always supported dual cores, and it isn't via "internal hypertransport", the internal crossbar connects to the SysReq that supports two cores attached directly. You cannot attach a shared cache dual core to this design. Each core must have its own individual L2 cache. This is why you could have an 8 processor Opteron system with dual-cores for 16 cores in total despite the fact that the current Opteron can only do 8 processors at the most glueless. Oh, and Hypertransport doesn't connect to memory either, the memory controller is something else connected to the internal crossbar.

      And for the Opteron this is a good design. As the cores are on the same chip, cache coherency will be done at the speed of the processor and not be limited by inter-processor bandwidth. It really isn't a problem at all that the cores each have their own individual cache. At least they aren't competing with each other for cache bandwidth. The only bad point is that a core cannot have the option of using up to 2MB of shared cache - not as big a problem as it might sound, 1MB is doing very well for Opteron, and the on-die memory controllers negate a lot of the latency penalty for main memory access.

    4. Re:Note: Here, Single is Better by randyest · · Score: 4, Informative

      Interconnect delay (latency) is reduced. Signals propagate traces on a die (silicon chip) are orders-of-magnitude faster than printed-circuit board (PCB) traces.

      That means you can get more bandwidth with silicon than a circuit board (each of reasonable size using modern components/processes.)

      Also, it takes a lot less power to run lower-voltage drivers on low loads (little resistance and capactiance on die compared to a PCB.)

      So, why not stack everything on onw chip? Cost of a chip rises exponentially with die size. Up to about 20mm^2, it's feasible (but pricy) bigger dice are very hard to make, result in lower yields, and hence cost a lot more.

      --
      everything in moderation
  2. Different core models by SIGALRM · · Score: 5, Informative
    The dual-core chips that Advanced Micro Devices and Intel plan to bring to market next year won't be sharing their memories
    As I understand it, the rationale behind Opteron's "Direct Connect" dual-core architecture is to make it easier to place two processor cores on the same silicon die. It's also a power-consupmtion issue, as the two processors can run at lower clock speeds. However, unlike Intel's design, Direct Connect features an integrated memory controller and hypertransport interconnects that connect the processor to the I/o port or directly to another processor.
    --
    Sigs cause cancer.
  3. Re:Confused by dougmc · · Score: 4, Informative
    No, you're not missing the point.

    The benefit is that you get two CPUs in less space. You might even be able to get two CPUs in a system designed to support only one (because it has only one slot.) And if your system already has two CPU slots, this might give you four CPUs.

    It might also use less power than two CPUs, but I wouldn't hold my breath on that one.

  4. Re:Licensing Issues? by Ianoo · · Score: 5, Informative

    When hyperthreading was released, the industry had to cope with similar issues. Those of us using operating systems with artificial limits imposed on the number of possible processors used in a system had to wait for software updates to fix detection. I'm sure that the same thing will happen again, undoutedly there will be some flag in a register somewhere that identifies whether a processor is part of a dual-core chip or just a single CPU on its own. The OS or software can just read this in and work out whether there is sufficient licensing to use them.

  5. Re:Confused by ERJ · · Score: 5, Informative

    Kinda. I could see a couple advantages though:

    1) Fast interconnect between chips. Instead of having to transfer data over the bus, if the CPU needed info from the other CPU it could transfer over a high speed connection without having to involve other parts of the machine (bus). AMD already has a sort of high speed interconnect to their multi-cpu motherboards instead of splitting like intel does but I would imagine that this would still be faster.

    2) Less motherboard room needed. You don't need dual cooling fans, dual power / interface lines and have more room overall on the motherboard.

  6. Re:How is this different from a two processor syst by hawkbug · · Score: 5, Informative

    It's not much different - that's the point. 2 processors in a single socket, saves a lot of money production wise, and that should pass onto the consumer. AMD has said their's is backward comaptible, and that's huge. You already got a single cpu opteron workstation? Well now you can have a dual cpu one for the price of a single cpu upgrade. That kicks ass.

  7. Re:Itanium? (somewhat off-topic) by Anonymous Coward · · Score: 5, Informative

    Despite what Sun has to say on the matter, Itanium system and processor sales have been increasing steadily since 2H,2000prior to that, there was a big lull in demand because few wanted to buy underperforming Itanium 1 machines when the Itanium 2 was expected rather soon (and announced relatively early).

    Today, in contrast, there _doesn't_ appear to a lull in demand for Itanium 2 machines, even though Montecito (Itanium 3) has been announced in a fair bit of detail. That's because for some applications (in HPC, high-end database work, certain EDA/CAD/CAE work, and ultra-high-reliability computing) Itanium 2 systems are basically unbeatable. They also run some OSes which are very important to some organizations, such as HP-UX and OpenVMS.

    Long story short, the Itanium 1 was something of a flop, the Itanium 2 is really pretty decent, and everyone is expecting the Itanium 3 to offer pretty decent _price/performance_, in addition to best-bar-none performance when it is released next year.

  8. Re:mmmm cores by iNiTiUM · · Score: 5, Informative

    Sure you can
    Oh you want one for the AMD64?
    How about these?

    --
    When encryption is outlawed, ou++1!@(93j++js-d9298yIUH(*Y24JKB!~
  9. Re:mmmm cores by kennedy · · Score: 4, Informative

    wrong. the ps2 has a 64bit MIPS cpu with *128bit extentions*. Think MMX or SSE.

  10. The down-side to this.... by NerveGas · · Score: 4, Informative


    The downside is that as the AMD chips are going to be backward-compatible with older boards, I imagine that the dual-core chip will still only have the single 128-bit memory controller.

    While that will still give you twice as many available CPU iterations, that means that the two cores will be fighting for memory bandwidth. In the case of Intel's chips, that's business-as-usual: But for the Opterons, where each processor brings its own memory controller, it just doesn't feel right. : (

    steve

    --
    Oh, you're not stuck, you're just unable to let go of the onion rings.
  11. Re:Dual core - what's the point? by drinkypoo · · Score: 4, Informative

    Hyperthreading is simply a second context. It lets you run a second thread at the same time by using the unutilized capacity of existing functional units and is largely useful only when intel's branch prediction fails and the chip would otherwise be paying the ultimate penalty for its long, long, LONG pipeline.

    In other words, HT is an ingenious method for making up for the fact that the pentium 4 is horribly inefficient.

    It would be better to stick a whole bunch of simple cores on a single chip at a lower clock rate and have them work cooperatively, if only we used more multithreading. This is pretty much where intel is planning to go, with their multiple-core chips based on the Pentium-M. Or, so the rumors say.

    --
    "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  12. Re:The G5 uses hypertransport... by shawnce · · Score: 4, Informative

    Actually they don't use the same bus technology.

    The G5 (PPC970/970FX) has a two 32 bit wide buses one going in each direction from the CPU and they have a data rate at half that of the CPUs clock rate. At a clock rate of 2.5GHz the bus is capable of a max theoretical throughput of 5GB/s each direction or 10GB/s in total (that is per CPU). Real world throughput is around 8 GB/s per CPU at 2.5GHz because of address/command overhead. Apple/IBM terms this the elastic bus and it is not HT based.

    For more information see this block diagram referenced from this hardware tech note.

    Anyway the the post you are replying to is incorrect about each CPU having its own RAM. That is not true. Each CPU has it own independent bus to the memory controller (U3/U3H) and that controller has a dual channel connection memory capable of 6.4GB/s a second (DIMMS are required to be added in pairs to allow for a 128 bit wide path to memory). The U3 chip is basically cross bar like internally allowing for a few point-to-point connections to be taking place between its various interfaces (CPU to CPU, AGP to memory, etc.).

    HT is used for as a secondary interconnect to relatively lower bandwidth devices in the IO chain.

  13. Re:mmmm cores by shawnce · · Score: 4, Informative

    Pulling in a post of mine from a completely different forum...

    The G5 is a 64 bit processor and OSX Panther is a 64 bit OS. :)

    Panther is not a true 64 bit OS in the traditional sense of the word. It does not support 64 bit addressing[1]. It does however support the use of 64 bit math operations and the saving of related registers on the CPU.

    Tiger (Mac OS 10.4) will have the first steps towards a true 64 bit OS by allowing 64 bit addressing (virtual addressing) to be used for libSystem only based tools (command line applications, no GUIs, etc.). At least that is all that Apple has so far committed to doing in Tiger at this time (cannot say more because of NDA).

    [1] Note the Panther kernel has support for 64 bit physical addressing so the system can utilize greater then 4 GBs of RAM (hardware wise supporting up to 16 GB of RAM) but it does not support 64 bit virtual addressing (what applications use) at this time.

  14. Re:Day late, dollar short... by kscguru · · Score: 4, Informative
    These chips must share a relatively slow memory bus with other devices.

    No... on AMD chips the memory bus is dedicated. Intel chips have a very different system architecture (which does saturate at ~2 CPUs), but AMD gives each chip its own memory controller and memory - scales perfectly. (By the way, this isn't new ... big iron (e.g. Sparc) has been doing this for years).

    Currently, the fastest FSB to date is 1033MHz - almost 1/3 of the max clock speed of the processor. Given that Intel's integer units operate at twice the clock speed, the fastest parts of the chip operate at 6 times faster than memory.

    That's why modern processors use pipelining (in x86, since 486's) and caches (since, uh, 8086s ?). FSB only comes into play in 1-2% of the memory accesses. But those memory accesses are pipelined, interleaved, with multiple outstanding requests issued by the out-of-order pipeline ... processor designers have been working around a slow bus for years, and the FSB is only the bottleneck in extreme, pathological cases.

    The monolithic, synchrous, central-processing-unit design of the architecture prohibits optimizations such as using memory controllers for block moves and having dedicated IO processors

    Ever heard of DMA? A DMA controller does that memory transfer ... there are 2 DMA controllers with 8 channels on your current x86 PC. Heck, high-end PCI cards even have their own onboard DMA engines (it's called bus-mastering). I/O offload? You've obviously never written a device driver... modern drivers issue a few "start" instructions, then sleep; eventually the device completes the I/O and issues an interrupt to inform the CPU it's done. The last computer I had that stalled on disk I/O was running MS-DOS - nine years ago.

    In all fairness, I thought exactly the same things four years ago. Then I learned about modern computer architecture. And in today's world (and, in fact, all PCs for the past ten years), your points are completely - and utterly - irrelevant.

    --

    A witty [sig] proves nothing. --Voltaire