Intel's 128MB L4 Cache May Be Coming To Broadwell and Other Future CPUs
MojoKid writes "When Intel debuted Haswell this year, it launched its first mobile processor with a massive 128MB L4 cache. Dubbed "Crystal Well," this on-package (not on-die) pool of memory wasn't just a graphics frame buffer, but a giant pool of RAM for the entire core to utilize. The performance impact from doing so is significant, though the Haswell processors that utilize the L4 cache don't appear to account for very much of Intel's total CPU volume. Right now, the L4 cache pool is only available on mobile parts, but that could change next year. Apparently Broadwell-K will change that. The 14nm desktop chips aren't due until the tail end of next year but we should see a desktop refresh in the spring with a second-generation Haswell part. Still, it's a sign that Intel intends to integrate the large L4 as standard on a wider range of parts. Using EDRAM instead of SRAM allows Intel's architecture to dedicate just one transistor per cell instead of the 6T configurations commonly used for L1 or L2 cache. That means the memory isn't quite as fast but it saves an enormous amount of die space. At 1.6GHz, L4 latencies are 50-60ns which is significantly higher than the L3 but just half the speed of main memory."
I have a Retina MacBook Pro with this Crystal Well processor. What advantages does it really bring?
Unsure of any real world benchmarks compared to standard Haswell processors.
because of the 128mb cache, i was fast enough to get 1st post!
. . .that Broadwell broad, well, is a broad well into which you could throw your entire career.
Just say no, David.
Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear
"At 1.6GHz, L4 latencies are 50-60ns which is significantly higher than the L3 but just half the speed of main memory."
WTF? The correct would be, I think, half the latency of main memory...
At 1.6GHz, L4 latencies are 50-60ns which is significantly higher than the L3 but just half the speed of main memory.
Hmmm. L4 cache runs at half the speed of main memory? That doesn't seem right Why bother reading these summaries? The people posting them certainly don't
Broadwell represents a miniaturization step from 22 to 14 nm structures. Why do they keep the capacity of the Crystalwell L4 cache at 128 MB? They could put twice that memory onto a die with the same area as the 22 nm Crystalwell version. Is the Crystalwell die for the Haswell CPUs so large and expensive that they have to reduce its size?
With this 128MB cache, shouldn't this CPU be able to run an OS like Win95 of an older Linux without additional memory?
Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
I want 1 gig of L1, L2, L3 cache!
Memory is dirt cheap right now. Lets cram it into every space we have!
I can only attribute this making sense to a really bad memory interface.
a cpu with a 1.6 GHZ clock will definitely have high memory latency. so how exactly does that point to a bad memory interface? I also assume since its a mobile part, the memory probably isn't super fast and low latency to begin with. you sound like an AMD FANBOY to me. intel has no need at all to be desperate, they could take a year off and still be in the lead
It's in the same package, but not made in the same silicon or process. The package contains several pieces of silicon. Look at it as a miniature circuit board with several individual chips on it.
I was promised a flying car. Where is my flying car?
This is making me feel old as I recall how happy I was to have once maxed a board with 32 MB of RAM, a previous one with 8 MB, another with 4 MB and so on. I love that about technology, it pretty much always gets better until DRM and politics get into the mix...
128MB L4 cache. [...] on-package (not on-die) pool of memory
what this means is the memory is not on the same piece of silicon as the CPU, just stuffed in the same chip package. this means they have to be connected by a lot of tiny wires instead of being integrated directly. the downside to this is that there is bandwidth between the L4 memory and the CPU is very limited and it uses more power. like AMD's first APUs where just two ICs on the same chip, i dont not think this will result in a drastic performance improvement but i'm unsure of the power savings. If AMD gets wise, they will beat Intel to the punch but then again. though if AMD is really smart, they would put out ARMv8 chips not just for servers(/desktops?) but for smartphones/tablets and laptops.
Anons need not reply. Questions end with a question mark.
You're weird, man. One of the sillier trolls on here. I wonder why you do it and what you're like IRL.
I'll admit that I don't keep up with all the latest and greatest chipsets and specs but something seems wrong here. I remember back in the 1990's that you could get FPM (and later EDO) DIMMS that were 60ns. This article is saying that this new L4 cache is 60ns implying that the latest DDR3-whatever has a latency of 120ns. (Assuming half the speed means half the latency).
I don't know,man. I think it's the same guy with the fetid rectum. I'd stay away.
All my algorithm development so far assumes small local caches.
Now I can start over again.
Aaaahhh!!!
Add moar cache to fix cpu problems
I may not not get the speed out the caches but when you consider how much RAM is utilized in your laptop, smartphone, etc., this is actually a smart move. More room means means a better way to utilize the RAM allowing other opportunities to exist..
At 1.6GHz, L4 latencies are 50-60ns which is significantly higher than the L3 but just half the speed of main memory.
Don't you mean "but less than half the latency of main memory?"
So as soon as i get one of these, i won't need any DRAM anymore, since 128MB is way more than my typical memory footprint (including kernel and X11)
I do look forward to this.
CLI paste? paste.pr0.tips!
first they added more cores, now they're adding more cache. What's next? Integrated chipset or DRAM?
All these are cheap and worthless improvements! We need faster CPUs - 8GHz for single-core this year and 16GHz next year!
Come on Intel, the last time you did this to us was with the off-dye Pentium 2 L2 cache, and only the Xeon had the full speed cache. then you made the Celeron's not have the cache and the performance gap between the three was substantial.
I'm not saying we shouldn't have it, but I am saying that history repeats itself. The first chips that have it, will have it, and then either it will go away or will be integrated into the cpu dye assuming we can get another dye shrink (It's predicted that 14nm or 12nm may be where there's no longer any ROI on dye shrinking with Si)
POWER8, anyone? With actual SMT instead of flakey HT, and lots more threads, and so on, and so forth.
Too bad they're unobtanium and if not cost too much. But otherwise... anything intel does has basically been done better before. Except process. That is the only thing they really lead with. The rest isn't half as interesting as most of the world makes it out to be.
Even the 6MB of L3 that modern processors have is larger than the entire system memory of our parents' first computers.
A 6 MB L3 cache is bigger than the RAM in the PlayStation, Nintendo 64, or Nintendo DS. A 128 MB L4 cache would surpass the RAM in a PlayStation 2 and an original Xbox combined. You don't need a lot of DDR to play DDR, even if you live in the former DDR.
You can get this today, but it's not as flexible as you might wish:
- You currently can only get it in a high-end i7 laptop. Desktop and low-end laptop i7 chips don't have it.
- It's only active when the GPU is not used, so you need a discrete GPU in your system, and it has to be on all the time.
- You can't use this as system memory or whatever (as some of the other comments were hoping for.) All it ever stores are the flushed misses from the L3 cache.
- It massively increases the memory working set, which can benefit some algorithms (e.g. physics simulations, software H.265 encoding) enormously. See graph over at Anandtech:
http://images.anandtech.com/doci/6993/latency.png
The high-end desktop chip is predictably left in the dust in the 8MB-128MB range. Whether this trumps its other advantages is probably only true for a few algorithms.
I actually have it active in the Haswell laptop I'm typing this on, but it's an uncommon setup. To get one, go to the Apple store, select the highest-end Retina Macbook Pro (the only one that still has discrete graphics) and click the processor upgrade to 2.6GHz so that you end up with a i7-4960HQ (the 2.3 chip might have it too, not sure.) Then go to the Energy Saver Preferences and turn off Automatic Graphics Switching so that the discrete GPU is on all the time.
Formerly, L4 cache was main memory, a cache for the L5 (disk) and L6 (network). This new L4 cache pushes main memory, disk, and network out to L5, L6, and L7 respectively.
Did your 128 MB laptop continue to run Windows XP well even after having installed the service packs that increased how much RAM it uses? Even under Windows 2000, printing certain documents filled RAM on my old 128 MB desktop PC.
With Intel's 14nm so close, and 10nm production in another year or so, they need to use all that chip area for something that doesn't necessarily generate a ton of heat. RAM is the perfect thing. Not only is the power consumption relatively disconnected from the size and density of the cache, but not having to go off-chip for a majority of memory operations means that the external dynamic ram can probably go into power savings mode for most of its life, reducing the overall power consumption of the device.
-Matt
According to the summary, L4 cache has 50-60ns latency, and is half the latency of main memory (presumably 100-120ns).
The summary is bad, because it gives the impression that the 70ns static column ram that comprised the main system ram on an Amiga was almost as fast as today's slower cache ram, and had almost double the performance of DDR3.
The truth is that the 50-60ns latency (vs 70ns, vs 100-120ns) is "time to fetch first arbitrary byte at some arbitrary address". However, at best, that tells (statistically) less than 25% of the story, because CPUs don't fetch single arbitrary bytes from single arbitrary locations. At the very least, they're usually going to grab at least 4 sequential bytes, if not WAY more. And that's where the difference comes in. The smallest meaningful benchmark would be more like, "how many nanoseconds does it take to fetch 16 or 32 consecutive bytes from an arbitrary address in ram" (4 or 8 bytes for the opcode, 4 or 8 bytes for the argument, assuming one pair that actually does something related to fetching/storing/arithmetic, and another that makes a branch decision based upon it(*)).
Going by the "fetch 16 or 32 bytes" benchmark, even slow 120ns DDR3 is going to completely smoke 70ns SCR, because SCR read bytes 2 through 32 just 8 (later, 16 or 32) bits at a time with a clock rate of (at best) 32MHz. In contrast, the 120ns DDR3 transfers the sequential bytes at a rate equivalent to a 32, 64, or 128-bit bus with 100 or 133MHz clock rate (as I understand it, the 800mhz, 1600mhz, and higher insane-level speeds came about because they reduced the number of physical traces and serialized 8 bits into a pair of LVD traces (so 800mhz is roughly equivalent to 8x100MHz), then started doing "Atari Math" (deciding that 4 800MHz serial links are "3200MHz" by "doing the math" and adding them up to get a bigger number).
That said, from what I recall, the performance of system ram on mainstream PCs has basically stagnated since DDR. It's gotten enormously cheaper to IMPLEMENT, but a modern workstation-class PC (say, Dell Precision or higher) with DDR3 can physically fetch 4 arbitrary blocks of 32 bytes from main system ram in *maybe* 70-80% of the time it took a comparable workstation back in the DDR era (at least, from the perspective of a single-threaded app... obviously, dual/3-channel memory, multi-core norms, and SMP-aware software could change the equation a bit if you're talking about OVERALL system performance).
(*)Yes, I know realmode opcodes aren't 4 bytes... but realmode opcodes are almost irrelevant to anything compiled for x86 or AMD64 under Windows or Linux, anyway.
No one ever needs more than 640KB. :P
I do not fail; I succeed at finding out what does not work.
Let's face it, SRAM isn't tiny, it leaks amps like a sieve at the tiny process size that everything is done at now days, and it's main advantage is that it doesn't take a controller to access and it's bloody fast and the bandwidth can be pretty sizable.
Then perhaps MoSys had the right idea: make a bunch of small, independent DRAM blocks and a front-end controller with as much SRAM as one block to hold cached results while waiting for the corresponding DRAM row to refresh.
Two points, overclocking memory sometimes requires increasing the voltage but store-bought faster memory most of the time uses no extra power, it could even use less because current needs to flow for less time. And, my own testing with cpu's having different amounts of L2 cache suggests that idle consumption increases with cache size. Speed also increases but probably only with small cache sizes (going from 256kb to 512kb L2 gave 3% quicker compiles with GCC, with modern cache sizes the difference would be negligible).
So I don't see where this potential improvement will be realized because compiling is surely a heavy user of cpu and memory.
Actually it is the other way round: The slower the CPU, the faster the memory interface in comparison and the less need for caches. What Intel does here makes no sense unless they are covering up an architectural problem. Memory clock is _not_ tied to CPU clock in a sane architecture. Otherwise you would need to buy memory by CPU clock. You do not need to to that for Intel or AMD.
Also, desperation engineering-wise has not necessarily any connection to business-desperation. And remember that Intel messed up very badly before, just look at the Pentium IV that was born out of deep engineering desperation, was slow, excessively power-hungry and could really only be scrapped later on. But I guess thought on this level is beyond you.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.