Intel's 128MB L4 Cache May Be Coming To Broadwell and Other Future CPUs

← Back to Stories (view on slashdot.org)

Intel's 128MB L4 Cache May Be Coming To Broadwell and Other Future CPUs

Posted by timothy on Friday November 22, 2013 @11:46PM from the now-read-some-old-prices-and-get-offa-my-lawn dept.

MojoKid writes "When Intel debuted Haswell this year, it launched its first mobile processor with a massive 128MB L4 cache. Dubbed "Crystal Well," this on-package (not on-die) pool of memory wasn't just a graphics frame buffer, but a giant pool of RAM for the entire core to utilize. The performance impact from doing so is significant, though the Haswell processors that utilize the L4 cache don't appear to account for very much of Intel's total CPU volume. Right now, the L4 cache pool is only available on mobile parts, but that could change next year. Apparently Broadwell-K will change that. The 14nm desktop chips aren't due until the tail end of next year but we should see a desktop refresh in the spring with a second-generation Haswell part. Still, it's a sign that Intel intends to integrate the large L4 as standard on a wider range of parts. Using EDRAM instead of SRAM allows Intel's architecture to dedicate just one transistor per cell instead of the 6T configurations commonly used for L1 or L2 cache. That means the memory isn't quite as fast but it saves an enormous amount of die space. At 1.6GHz, L4 latencies are 50-60ns which is significantly higher than the L3 but just half the speed of main memory."

8 of 110 comments (clear)

Min score:

Reason:

Sort:

Re:So in the real world? by SimonTheSoundMan · 2013-11-23 00:19 · Score: 4, Informative

The only benchmarks I have found is from SiSoftware. http://www.sisoftware.co.uk/?d=qa&f=mem_hsw
But how is this going to effect Firefox, Photoshop, or video conversion?
Does it have an effect on battery life?
Re:So in the real world? by fuzzyfuzzyfungus · 2013-11-23 00:38 · Score: 5, Informative

At least as marketed, the main advantage is allowing the GPU some RAM that isn't DDR3 stolen from the main system a couple of hops away (which has traditionally been one of the things that make integrated graphics really suck, and cheap discrete parts that use DDR instead of GDDR, and/or an excessively narrow or slow memory bus kind of suck).

Given that even intel's marketing optimists don't say much about CPU performance (and also given that it's a mobile-only feature, you can't even buy an non-BGA part expensive enough to have it, which would be unusual if it actually improved CPU performance enough to get enthusiasts worked up; but is downright sensible if the target market is laptops sufficiently size/power constrained not to have discrete GPUs; but where pure shared memory was dragging GPU performance down.)
Re:So in the real world? by muridae · 2013-11-23 01:23 · Score: 5, Informative

Photoshop? Considering that the adobe rgb or other color spaces combined with the file sizes of some of the larger images coming out of cameras, your gains in latency would really depend on Photoshop and the OS being able to handle the L4 cache and keep the right part of the image in the cache. Video editing, with file sizes into the gigabyte range would probably see no gains at all. Video conversion, with a program that keeps a reasonably sized buffer, should see a good performance gain; but it would require code that knows the L4 is available or the OS to predict that L4 is a good place to put a 10-50-100MB buffer. The real gain will be in common things: playing a video, browsing the web (seen how much memory a bit of javascript or the JRE can eat up lately? Or Silverlight/Flash?) and email clients (cache all your email in L4 for faster searching).
As for battery life, I have no idea. It might use more power, since DRAM requires constant power to refresh data where SRAM is pretty stable; but the lower leakage of using a single transistor instead of 6 might prove to be a benefit. It would take a good bit of time and some pretty good test code to figure the difference, I suspect.
Re:Ours goes to 11 by muridae · 2013-11-23 01:46 · Score: 4, Insightful

Let's see, the tiny amount of L1/2/3 cache currently is dictated by the energy budget of the CPU. Looking at the energy budget of the 4900MQ and the 4960HQ chips, you can take some wild arse guessing to get that the 2 megs of L3 cache sacrificed got back enough to power the 128 megs of L4. Then consider that there is only 64K (yes, kilobytes) of L1 or 256K L2 per core on the Haswell chips, and at 3.9GHz desktop chips you are looking at 84 watts of power dissipated . . . you can start to work out how much of that is due to leakage current from the 6 transistor L1/2/3 cache design.
Let's face it, SRAM isn't tiny, it leaks amps like a sieve at the tiny process size that everything is done at now days, and it's main advantage is that it doesn't take a controller to access and it's bloody fast and the bandwidth can be pretty sizable. A gig of SRAM on die would, I suspect, heat a small room; that much DRAM per core would slow the cores down due to the inherent latency of accessing DRAM.
So, sure, DRAM chips may be cheap, but putting them on the CPU die would be horrid. And SRAM still isn't cheap; either in die space, energy budget, or dollars!
Re:So in the real world? by SuricouRaven · 2013-11-23 01:51 · Score: 5, Interesting

Cache performance impact is very heavily dependant upon application characteristics. Specifically, active memory.
Best case, when you're working with an active set that's larger than L3 but under L4 - around 100MB or so - and you're accessing it on a repeating pattern, and the compiler hasn't found any tweaks to help, and you're not multitasking, and the OS isn't swapping you out every slice, and the stars are aligned in your favor... the theoretical maximum performance gain can be up to 2x. It's very rare you'll find a program that benefits that much, though. Closest I can think of is image processing.
So in the real world, anywhere from 'no benefit' to 'double the speed' depending on application.
not on die by Gravis+Zero · 2013-11-23 02:00 · Score: 5, Informative

128MB L4 cache. [...] on-package (not on-die) pool of memory
what this means is the memory is not on the same piece of silicon as the CPU, just stuffed in the same chip package. this means they have to be connected by a lot of tiny wires instead of being integrated directly. the downside to this is that there is bandwidth between the L4 memory and the CPU is very limited and it uses more power. like AMD's first APUs where just two ICs on the same chip, i dont not think this will result in a drastic performance improvement but i'm unsure of the power savings. If AMD gets wise, they will beat Intel to the punch but then again. though if AMD is really smart, they would put out ARMv8 chips not just for servers(/desktops?) but for smartphones/tablets and laptops.

--
Anons need not reply. Questions end with a question mark.
1. Re:not on die by lenski · 2013-11-23 03:20 · Score: 5, Informative
  
  what this means is the memory is not on the same piece of silicon as the CPU, just stuffed in the same chip package.
  Which allows the designers to count on carefully controlled impedances, timings, seriously optimized bus widths and state machines, and all the other goodies that come with access to internal structures not otherwise available.
  Such a resource could, if used properly, be a significant contributor to performance competitiveness.
Re:Why only 128 MB? by Kjella · 2013-11-23 02:32 · Score: 5, Informative

Broadwell represents a miniaturization step from 22 to 14 nm structures. Why do they keep the capacity of the Crystalwell L4 cache at 128 MB? They could put twice that memory onto a die with the same area as the 22 nm Crystalwell version. Is the Crystalwell die for the Haswell CPUs so large and expensive that they have to reduce its size?
From Anandtech's article on Crystalwell:

There's only a single size of eDRAM offered this generation: 128MB. Since it's a cache and not a buffer (and a giant one at that), Intel found that hit rate rarely dropped below 95%. It turns out that for current workloads, Intel didn't see much benefit beyond a 32MB eDRAM however it wanted the design to be future proof. Intel doubled the size to deal with any increases in game complexity, and doubled it again just to be sure. I believe the exact wording Intel's Tom Piazza used during his explanation of why 128MB was "go big or go home". It's very rare that we see Intel be so liberal with die area, which makes me think this 128MB design is going to stick around for a while.

I get the impression that the plan might be to keep the eDRAM on a n-1 process going forward. When Intel moves to 14nm with Broadwell, it's entirely possible that Crystalwell will remain at 22nm. Doing so would help Intel put older fabs to use, especially if there's no need for a near term increase in eDRAM size. I asked about the potential to integrate eDRAM on-die, but was told that it's far too early for that discussion. Given the size of the 128MB eDRAM on 22nm (~84mm^2), I can understand why. Intel did float an interesting idea by me though. In the future it could integrate 16 - 32MB of eDRAM on-die for specific use cases (e.g. storing the frame buffer).

--
Live today, because you never know what tomorrow brings