Dual Caches for Dual-core Chips

Note: Here, Single is Better by Anonymous Coward · 2004-08-26 10:10 · Score: 5, Informative

In case it's not obvious to those who didn't read the article all the way through, it's a better thing when the memory is shared (single cache) rather than separate (dual cache). But that is harder to design, so for these first-generation dual-core chips from Intel and AMD, they are using separate caches for each core. (IBM's dual core Power4 processor has a unified cache.) At some point down the road, they will likely unify them to increase performance.

Re:Note: Here, Single is Better by spuzzzzzzz · 2004-08-26 10:49 · Score: 5, Informative

Are there situations where two caches might be better? For example, a multi-threaded application with two memory-intensive threads, each locked down onto a specific CPU?
Not really. The problem with 2 caches is duplication. It is quite probable that both cores will want to work on the same thing, in which case cache space will be wasted. It also creates timing complications when one core wants to write to its cache because the other core will have to be told to invalidate its relevant cache entry. On the other hand, you could create a single cache with double the size. This would make sharing memory between CPUs simpler and it wouldn't significantly increase access times (so the situation you mentioned wouldn't be affected). The argument for double caches is about cost, scalability and design simplicity, not performance.

--

Don't you hate meta-sigs?
Re:Note: Here, Single is Better by jackb_guppy · 2004-08-26 11:05 · Score: 4, Informative

The PPC4 does not have single cache...

There a L1 caches for both cores.

There are 3 L2 caches hooked to cross bar switch for speed flowing data into and out of the L1

There is a single L3 controller overseeing 2 L3 external memory banks.

Then there is two busses to 2 main memory.

And 3 interconnects to 3 other dual core chips that make a single 8way processor block.

And 4 busses inter connecting 4 of these 8way to make a 32way machine, with dual IO channels to hardware!
Re:Note: Here, Single is Better by hattig · 2004-08-26 11:07 · Score: 4, Informative

No no no no.

That's all wrong.

The Opteron has always supported dual cores, and it isn't via "internal hypertransport", the internal crossbar connects to the SysReq that supports two cores attached directly. You cannot attach a shared cache dual core to this design. Each core must have its own individual L2 cache. This is why you could have an 8 processor Opteron system with dual-cores for 16 cores in total despite the fact that the current Opteron can only do 8 processors at the most glueless. Oh, and Hypertransport doesn't connect to memory either, the memory controller is something else connected to the internal crossbar.

And for the Opteron this is a good design. As the cores are on the same chip, cache coherency will be done at the speed of the processor and not be limited by inter-processor bandwidth. It really isn't a problem at all that the cores each have their own individual cache. At least they aren't competing with each other for cache bandwidth. The only bad point is that a core cannot have the option of using up to 2MB of shared cache - not as big a problem as it might sound, 1MB is doing very well for Opteron, and the on-die memory controllers negate a lot of the latency penalty for main memory access.
Re:Note: Here, Single is Better by spuzzzzzzz · 2004-08-26 12:09 · Score: 2, Informative

If (1) your kernel did a perfect job of keeping a single process/thread confined to a single CPU and (2) none of your processes/threads were sharing the same memory then a dual cache would perform about the same as a single cache. The main problem here is number (2). When you're running a multi-threaded program, even if the kernel manages the CPUs perfectly, the threads will usually want to share some memory. They may need to pass information to each other or they may just be working on the same data set. In either case, a dual cache makes things worse.

There are plenty of situations where a single cache would perform better than a dual cache, but there are (almost?) no situations where a dual cache would perform better than a single cache. Hence single cache is better performance-wise.

Bear in mind, of course, that this is speculation. I have not carried out any benchmarks of single cache versus dual cache and I don't think there are any publicly available benchmarks comparing them. From a purely academic point of view, however, single cache wins.

--

Don't you hate meta-sigs?
Re:Note: Here, Single is Better by randyest · 2004-08-26 12:45 · Score: 4, Informative

Interconnect delay (latency) is reduced. Signals propagate traces on a die (silicon chip) are orders-of-magnitude faster than printed-circuit board (PCB) traces.

That means you can get more bandwidth with silicon than a circuit board (each of reasonable size using modern components/processes.)

Also, it takes a lot less power to run lower-voltage drivers on low loads (little resistance and capactiance on die compared to a PCB.)

So, why not stack everything on onw chip? Cost of a chip rises exponentially with die size. Up to about 20mm^2, it's feasible (but pricy) bigger dice are very hard to make, result in lower yields, and hence cost a lot more.

--
everything in moderation
Re:Note: Here, Single is Better by timeOday · 2004-08-26 13:10 · Score: 2, Informative

it's a better thing when the memory is shared (single cache) rather than separate (dual cache).
Yeah, if the dual cache could be shared and still run without added latency or decreased bandwidth. That doesn't mean a different chip with a unified cache would be faster though.
Also, the same is true of dual cores in the first place. It would be better to have a single processor (without dual cores) if it could be twice as fast. Unfortunately, chip designers seem to be running out of ways to usefully employ all the transistors Moore's law is giving them. Now they're resorting to designs that employ parallelism, which is relatively easy to do, but harder to exploit in software, and sometimes hardly useful at all.

Different core models by SIGALRM · 2004-08-26 10:12 · Score: 5, Informative

The dual-core chips that Advanced Micro Devices and Intel plan to bring to market next year won't be sharing their memories

As I understand it, the rationale behind Opteron's "Direct Connect" dual-core architecture is to make it easier to place two processor cores on the same silicon die. It's also a power-consupmtion issue, as the two processors can run at lower clock speeds. However, unlike Intel's design, Direct Connect features an integrated memory controller and hypertransport interconnects that connect the processor to the I/o port or directly to another processor.

--
Sigs cause cancer.

Re:mmmm cores by Anonymous Coward · 2004-08-26 10:13 · Score: 2, Informative

Here you go. Works on dual-core, seperate cache chips already. (HP PA-8800)

Non-news event by doormat · 2004-08-26 10:15 · Score: 3, Informative

I've saw this article at another website earlier today, and I though this wasnt really important. Each core should have its own cache, thats exactly what a dual core chip is. Not twice as many execution units crammed into the same space, or some other funny configuration, its two seperate chips on the same die, perhaps some modifications for inter-processor communication, but thats about it. With AMD's core design, you have the physical layer only of the hypertransport bus to connect the chips, and the integrated memory controller has one or two ports to talk to memory (single/dual channel) and two ports to talk to two seperate chips. It will be interesting to see if AMD couples dual-core chips with DDR2-667 or DDR2-800, that would make the most sense, as to keep the memory controller from being the bottleneck, as opposed to the system bus on the intel side.

--
The Doormat

If you're not outraged, then you're not paying attention.

Re:Non-news event by silas_moeckel · 2004-08-26 10:53 · Score: 2, Informative

It is better. As another poster pointed out and I'll concure unified cache is better than seperate but a lot harder to make so the first generation dual core chips will not use it. Expect the second generation to have larger unified cache.

--
No sir I dont like it.

Re:Confused by dougmc · 2004-08-26 10:16 · Score: 4, Informative

No, you're not missing the point.

The benefit is that you get two CPUs in less space. You might even be able to get two CPUs in a system designed to support only one (because it has only one slot.) And if your system already has two CPU slots, this might give you four CPUs.

It might also use less power than two CPUs, but I wouldn't hold my breath on that one.

Re:Licensing Issues? by Ianoo · 2004-08-26 10:17 · Score: 5, Informative

When hyperthreading was released, the industry had to cope with similar issues. Those of us using operating systems with artificial limits imposed on the number of possible processors used in a system had to wait for software updates to fix detection. I'm sure that the same thing will happen again, undoutedly there will be some flag in a register somewhere that identifies whether a processor is part of a dual-core chip or just a single CPU on its own. The OS or software can just read this in and work out whether there is sufficient licensing to use them.

Re:Confused by ERJ · 2004-08-26 10:17 · Score: 5, Informative

Kinda. I could see a couple advantages though:

1) Fast interconnect between chips. Instead of having to transfer data over the bus, if the CPU needed info from the other CPU it could transfer over a high speed connection without having to involve other parts of the machine (bus). AMD already has a sort of high speed interconnect to their multi-cpu motherboards instead of splitting like intel does but I would imagine that this would still be faster.

2) Less motherboard room needed. You don't need dual cooling fans, dual power / interface lines and have more room overall on the motherboard.

Re:How is this different from a two processor syst by hawkbug · 2004-08-26 10:19 · Score: 5, Informative

It's not much different - that's the point. 2 processors in a single socket, saves a lot of money production wise, and that should pass onto the consumer. AMD has said their's is backward comaptible, and that's huge. You already got a single cpu opteron workstation? Well now you can have a dual cpu one for the price of a single cpu upgrade. That kicks ass.

Re:Confused by Lord+Kano · 2004-08-26 10:19 · Score: 3, Informative

I'm not a hardware pro, but is this basically the same as having two seperate chips, or am I missing the point here?

Pretty much the same thing as having two processors, but once things are running at proper capacity, it will be cheaper to put two cores on one chip. In part because you won't have to reproduce the underlying electronics. The motherboards will also be cheaper. One socket means less money spent on R&D. If and when someone releases a dual socket/quad core motherboard it will be cheaper to design and build than a quad socket board.

LK

--
"Hi. This is my friend, Jack Shit, and you don't know him." - Lord Kano

Re:Itanium? (somewhat off-topic) by Anonymous Coward · 2004-08-26 10:20 · Score: 5, Informative

Despite what Sun has to say on the matter, Itanium system and processor sales have been increasing steadily since 2H,2000prior to that, there was a big lull in demand because few wanted to buy underperforming Itanium 1 machines when the Itanium 2 was expected rather soon (and announced relatively early).

Today, in contrast, there _doesn't_ appear to a lull in demand for Itanium 2 machines, even though Montecito (Itanium 3) has been announced in a fair bit of detail. That's because for some applications (in HPC, high-end database work, certain EDA/CAD/CAE work, and ultra-high-reliability computing) Itanium 2 systems are basically unbeatable. They also run some OSes which are very important to some organizations, such as HP-UX and OpenVMS.

Long story short, the Itanium 1 was something of a flop, the Itanium 2 is really pretty decent, and everyone is expecting the Itanium 3 to offer pretty decent _price/performance_, in addition to best-bar-none performance when it is released next year.

Re:mmmm cores by EvilTwinSkippy · 2004-08-26 10:22 · Score: 3, Informative

OS X, or if you hate Apple, NetBSD.

Solaris.

The Playstation 2 is actually 128 bit. But that doesn't really count as an OS...

--
"Learning is not compulsory... neither is survival."
--Dr.W.Edwards Deming

Re:Licensing Issues? by elmegil · 2004-08-26 10:23 · Score: 2, Informative

A typical vendor, Oracle, when talking about a different chip (the newest SPARC chips) says "yes you must pay for each core". I would be surprised if many vendors with such licensing schemes have any other answer.

--
7 November 2006: The day Americans realized corruption and incompetence weren't addressing 11 September 2001

AMD seems more promising by leathered · 2004-08-26 10:24 · Score: 3, Informative

Luckily for AMD, the Opteron/A64 was designed with dual-core in mind. As I understand it both cores will talk to each other via an internal Hypertransport link and (as with current Opertons) together with the internal memory controller will eliminate the need for an external northbridge. It is also expected that upon release they will drop directly into existing motherboards with nothing more than a BIOS upgrade.

Intel will find things more challenging. Both cores will have to contend the GTL bus, currently the Achilles heel of their MP solutions, by communicating via an external northbridge.

--
For all intensive porpoises your a bunch of rediculous loosers

Re:mmmm cores by iNiTiUM · 2004-08-26 10:26 · Score: 5, Informative

Sure you can
Oh you want one for the AMD64?
How about these?

--
When encryption is outlawed, ou++1!@(93j++js-d9298yIUH(*Y24JKB!~

Re:Confused by Anonymous Coward · 2004-08-26 10:29 · Score: 2, Informative

I doubt the dual core processors will be socket compatible with existing single core processors, so you will be unlikely to be able to upgrade an existing motherboard to dual processor just by dropping in a different CPU. It is possible they will come out with new socket designs which can accomodate either dual or single core CPUs, but I wouldn't bet heavily on it.

The benefit, as you say, is in space, with possibly a small amount in power consumption, but I'd agree not to hold your breath, and even if it did, probably not a lot. Space savings isn't that big an issue for desktop systems in most cases, but it is a huge issue in things like blade servers and even 1U/2U rackmount servers. Fitting two huge CPU sockets, as well as all of the heat sinks and fans necessary into a 1U case is a real challenge, so only needing 1 socket and one heat sink is a huge win there.

The other benefit you don't mention is likely to be in cost, as a dual core CPU will probably be cheaper to manufacture than two single core CPUs. A significant portion of the cost of a CPU is the packaging (not the cardboard box, the ceramic casing, pins, heat spreader, etc), and the labor costs to put the chip into that package, if you only have to do that once, it will save money. Likely AMD and Intel won't pass all those savings along to us, so their margin on dual core CPUs will probably be higher, which is undoubtedly the reason they are pushing so hard in that direction.

Re:mmmm cores by kennedy · 2004-08-26 10:30 · Score: 4, Informative

wrong. the ps2 has a 64bit MIPS cpu with *128bit extentions*. Think MMX or SSE.

The down-side to this.... by NerveGas · 2004-08-26 10:34 · Score: 4, Informative

The downside is that as the AMD chips are going to be backward-compatible with older boards, I imagine that the dual-core chip will still only have the single 128-bit memory controller.

While that will still give you twice as many available CPU iterations, that means that the two cores will be fighting for memory bandwidth. In the case of Intel's chips, that's business-as-usual: But for the Opterons, where each processor brings its own memory controller, it just doesn't feel right. : (

steve

--
Oh, you're not stuck, you're just unable to let go of the onion rings.

Re:Itanium? (somewhat off-topic) by csimpkin · 2004-08-26 10:35 · Score: 2, Informative

The problem with the Itanium was that Intel didn't release an optimizing compiler with or before the Itanium. I believe (corrections welcome) that instructions are grouped in 'packets' (I forget the term used) that the Itanium can run in parallel. The problem is that only certain instructions can be bundled together. When older compilers are used the instructions are generated in a way that only a few or even just one instruction is in a 'packet'. So, the problem was that the processor wasn't being used to its fullest potential. I have never compaired the Itanium 1 and 2. But, I would guess that the Itanium 2 was primarily released to give the Itanium line a fresh start with an optimizing compiler.

Re:Yeah... by Carnildo · 2004-08-26 10:42 · Score: 2, Informative

but will it make coffee? I didn't think so.

Given that the power output of a single-core Prescott is 100 watts or more, a dual-core with separate caches will put out 200+ watts. Clock up the speed a bit more, and you'll be at about 300 watts.

I figure that's probably enough to boil a cup of coffee.

--
"They redundantly repeated themselves over and over again incessantly without end ad infinitum" -- ibid.

There already is one. by Medievalist · 2004-08-26 10:51 · Score: 2, Informative

VMS went 64-bit at least a decade ago.

Great OS for English-speaking folk, despite Linus's hatred for it.

Re:Dual core - what's the point? by ArbitraryConstant · 2004-08-26 10:54 · Score: 2, Informative

Hyperthreading is not a better solution, particularly when dealing with the Intel implementation. Unless it's very carefully done, all it does is keep the cache from working effectively. Linux and FreeBSD actually got performance improvements from leaving one of the virtual processors idle when there were more processes scheduled to run. When there's two threads of the same process, they let them both run because those tend to have better locality of reference and therefore don't thrash the cache so much.

Processor designers are in a different situation now than 10 years ago. They've got more transistors than they know what to do with, so adding cache and adding another core are cheap. Streamlining one core to run faster is much harder, as evidenced by Intel's unending troubles with anything faster than 3.2 ghz.

--
I rarely criticize things I don't care about.

Re:Dual core - what's the point? by drinkypoo · 2004-08-26 10:55 · Score: 4, Informative

Hyperthreading is simply a second context. It lets you run a second thread at the same time by using the unutilized capacity of existing functional units and is largely useful only when intel's branch prediction fails and the chip would otherwise be paying the ultimate penalty for its long, long, LONG pipeline.

In other words, HT is an ingenious method for making up for the fact that the pentium 4 is horribly inefficient.

It would be better to stick a whole bunch of simple cores on a single chip at a lower clock rate and have them work cooperatively, if only we used more multithreading. This is pretty much where intel is planning to go, with their multiple-core chips based on the Pentium-M. Or, so the rumors say.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"

Re:Commodity hardware grows mature. by Anonymous Coward · 2004-08-26 11:25 · Score: 1, Informative

Please. Dual.

Re:The G5 uses hypertransport... by shawnce · 2004-08-26 11:42 · Score: 4, Informative

Actually they don't use the same bus technology.

The G5 (PPC970/970FX) has a two 32 bit wide buses one going in each direction from the CPU and they have a data rate at half that of the CPUs clock rate. At a clock rate of 2.5GHz the bus is capable of a max theoretical throughput of 5GB/s each direction or 10GB/s in total (that is per CPU). Real world throughput is around 8 GB/s per CPU at 2.5GHz because of address/command overhead. Apple/IBM terms this the elastic bus and it is not HT based.

For more information see this block diagram referenced from this hardware tech note.

Anyway the the post you are replying to is incorrect about each CPU having its own RAM. That is not true. Each CPU has it own independent bus to the memory controller (U3/U3H) and that controller has a dual channel connection memory capable of 6.4GB/s a second (DIMMS are required to be added in pairs to allow for a 128 bit wide path to memory). The U3 chip is basically cross bar like internally allowing for a few point-to-point connections to be taking place between its various interfaces (CPU to CPU, AGP to memory, etc.).

HT is used for as a secondary interconnect to relatively lower bandwidth devices in the IO chain.

Re:mmmm cores by shawnce · 2004-08-26 12:13 · Score: 4, Informative

Pulling in a post of mine from a completely different forum...

The G5 is a 64 bit processor and OSX Panther is a 64 bit OS. :)

Panther is not a true 64 bit OS in the traditional sense of the word. It does not support 64 bit addressing[1]. It does however support the use of 64 bit math operations and the saving of related registers on the CPU.

Tiger (Mac OS 10.4) will have the first steps towards a true 64 bit OS by allowing 64 bit addressing (virtual addressing) to be used for libSystem only based tools (command line applications, no GUIs, etc.). At least that is all that Apple has so far committed to doing in Tiger at this time (cannot say more because of NDA).

[1] Note the Panther kernel has support for 64 bit physical addressing so the system can utilize greater then 4 GBs of RAM (hardware wise supporting up to 16 GB of RAM) but it does not support 64 bit virtual addressing (what applications use) at this time.

Re:Yield question by mercuryresearch · 2004-08-26 12:30 · Score: 3, Informative

The manufacturers have the choice of using multichip module packaging (common in notebook graphics controllers, for example) or single die, however it is my current understanding we're talking a single die.

They very likely WILL disable the dud and sell them as single core CPUs. This is how the "value" brands (Celeron, ex-Duron, and now Sempron) are typically created -- when there's a defect in the processor cache (which is a very large area of the die, and thus more likely to have a defect), the faulty bank(s) are turned off via fusing, creating a CPU with a smaller cache.

This is all pretty standard yield management.

Also, your calculations are very close to being correct, while the manufacturers closely guard their yield information, you're in the ballpark -- and it's interesting to note according to my estimates Intel's Celeron volumes approximately mirror your computed single-core yield percentage... meaning it will likely be business as usual in our dual core future.

BTW, if you're interested in computing yield values there's an excellent model to be had in one fo the chapters in Henessy and Paterson's _Computer Architecture, a Quantitative Approach_

Re:Day late, dollar short... by kscguru · 2004-08-26 15:36 · Score: 4, Informative

These chips must share a relatively slow memory bus with other devices.

No... on AMD chips the memory bus is dedicated. Intel chips have a very different system architecture (which does saturate at ~2 CPUs), but AMD gives each chip its own memory controller and memory - scales perfectly. (By the way, this isn't new ... big iron (e.g. Sparc) has been doing this for years).

Currently, the fastest FSB to date is 1033MHz - almost 1/3 of the max clock speed of the processor. Given that Intel's integer units operate at twice the clock speed, the fastest parts of the chip operate at 6 times faster than memory.

That's why modern processors use pipelining (in x86, since 486's) and caches (since, uh, 8086s ?). FSB only comes into play in 1-2% of the memory accesses. But those memory accesses are pipelined, interleaved, with multiple outstanding requests issued by the out-of-order pipeline ... processor designers have been working around a slow bus for years, and the FSB is only the bottleneck in extreme, pathological cases.

The monolithic, synchrous, central-processing-unit design of the architecture prohibits optimizations such as using memory controllers for block moves and having dedicated IO processors

Ever heard of DMA? A DMA controller does that memory transfer ... there are 2 DMA controllers with 8 channels on your current x86 PC. Heck, high-end PCI cards even have their own onboard DMA engines (it's called bus-mastering). I/O offload? You've obviously never written a device driver... modern drivers issue a few "start" instructions, then sleep; eventually the device completes the I/O and issues an interrupt to inform the CPU it's done. The last computer I had that stalled on disk I/O was running MS-DOS - nine years ago.

In all fairness, I thought exactly the same things four years ago. Then I learned about modern computer architecture. And in today's world (and, in fact, all PCs for the past ten years), your points are completely - and utterly - irrelevant.

--

A witty [sig] proves nothing. --Voltaire

Re:what's the diff: dual core and hyperthreading? by RockyMountain · 2004-08-26 18:05 · Score: 2, Informative

Note that cache sizes have fluctuated around 256-512 kb since the P2 days. My P2 and P4 both have 512 kb. I'd be shocked if the reason was something other than that being a sweet spot.

Sorry, I live in a 64-bit world, to the point that I'm quite ignorant of X86 state of the art. I've been blindly (and wrongly) assuming a 64-bit context for this whole conversation.

Your posting reminded me that caches of only 512M still exist! Montecito has 24M between 2 cores. Also, re-reading your posts in the context of 32-bit systems, they now make much more sense to me. X86 die aren't the same huge monsters that I'm used to. No wonder you and I have different views about yield cost tradeoffs -- we live on different parts of the curve.

Unfortunately, most of what I know about Itanium I really can't talk about, other than in very general terms -- assuming I wish to remain employed, that is. :)

The "sweet spot" for cache size is really determined by a race between core performance, and memory latency/bandwidth. Doubling cores doubles the data production/consumption rate. Doubling the frequency also does. The former is less demanding on memory latency and mostly requires more bandwidth. The latter is equally demanding on both. If you double the data production/consumption, and keep the same memory/bus bandwidth, ideally, you'd like to halve the cache miss rate -- but that's pretty unlikely in practice. That's why caches keep growing (at least in the 64-bit world). There is a point of diminuishing returns, because there's an upper limit to both temporal and spacial locality, but we're not quite there yet for Itanium.

When more CPUs start to have integrated memory controllers and point-to-point links instead of multi-drop busses, I predict that cache sizes per core will actually decline for a while, since the memory performance side of the balance will lower the "sweet spot". After that, caches will probably creep slowly upwards again, because no memory or interconnect technology ever scales fast enough to keep up with CPU core performance scaling.

So the $64 question is... When cache sizes per core start to decline or level off, will we see smaller die, or will we see even more cores per die? The way Intel seems to always position Itanium for high-end heavy metal, I expect huge die with more cores, although for X86 they went the other way. Or so I infer from your posting.

Of course, I'm no more thrilled with a 200 watt CPU than anyone else, but that's what you get with a two CPU system anyway.

Not sure which CPU you're referring to here. Is it something in the X86 line? If you're taking a single core power and multiplying by two, then you may be very pleasantly surprised. Montecito, despite having two cores, will have significantly lower total power than its single-core predecessors. (That much I can safely reveal, because it seems to be common knowledge already.)

Having designed systems, I can tell you that difficulties arising from high power-per-socket are very non linear: 200W isn't merely twice as hard to deal with versus 100W. It's easily an order of magnitude more difficult to cool with the same MTBF reliability. Luckily, Intel have realised this, at least in their 64-bit line. Once again, I am too ignorant of the 32-bit world to know the state of the art there.

Slashdot Mirror

Dual Caches for Dual-core Chips

35 of 342 comments (clear)