The Fundamentals Of Cache
Dave wrote to us with an article currently running on SystemLogic that delves
into caching, with specific examples taken from the Athlon and PIII processors. It also talks about the different types of cache - fairly technical, but an all around good read.
You cheeky man you!
In some ways you are demonstrating the effect of the cache, yes. However you assume 2-way associativity. Some processors are 4-way (increase the number of accesses to be 5 with the same low 12 bits).
_BUT_ you're also forcing the "cache unfriendly version" to have several other speed problems:
- It isn't fair to certain processors as there are restrictions on using addresses with the same low 4 address bits back-to-back (original Pentium certainly).
- You've chosen to access non-aligned data (accessing 16 bit values on an odd, and cache line spanning, address)
References:
Abrash - The Zen of Assembly Language Optimisation
Agner Fog - Pentopt (www.agner.org methinks)
FatPhil
Also FatPhil on SoylentNews, id 863
Modern processors have all sorts of things to deal with latency. Branch prediction, multiple issue slots, out-of-order execution, etc. However, this hardware generally requiers a group of instructions to work on, the more the better. Keeping a large number of instructions around for the processor to use requires more bandwidth.
That being said, when you have adequate bandwidth, latency does become more of a problem. That's why chip multiprocessors (CMPs) and Simultaneous Multithreading Processors (SMTs) are becoming a large focus of current processor research.
--
"You can put a man through school,
But you cannot make him think."
"You can put a man through school,
But you cannot make him think."
Ben Harper
From another perspective, a level of cache cuts the bandwidth requirement by about an order of magnitude. (VERY rough approximation, I agree.) In the 386 world, we saw L1 cache beginning at about 25MHz, though there were some uncached 33MHz designs. When we moved to the 486, in addition to the L1 being moved on-chip, the cycles-per-instruction improved, so it worked faster at a given clock speed, and required a better memory system behind it. So round about 66MHz, we saw the L2 cache start becoming normal. This continued with Pentium, though there were a few miserable failures that attempted to do without the L2.
At the same time, main memory is improving. Straight full-RAS-access begat page mode, which begat fast-page mode, which begat EDO, which bagat SDRAM, which begat DDR. But at the same time as main memory moved from 150nS access, 300nS cycle to 40nS access, 70nS cycle, with 7.5nS burst rate, CPUs have moved from 4.77MHz to 1.1 GHz.
I've already heard of an experimental Micron Northbridge chip that incorporates a bunch of fast DRAM-based L3 in otherwise unused area. Northbridge chip sizes are driven by pincount, not circuit area. They are prone to waste silicon just to get enough I/O pins bonded out.
I anticipate the widespread deployment of L3, any month now.
The living have better things to do than to continue hating the dead.
It isn't fair to certain processors as there are restrictions on using addresses with the same low 4 address bits back-to-back (original Pentium certainly).
True - this example is meant to show how the Pentium (PPLain in Agner's docs) can be tripped up.
- You've chosen to access non-aligned data (accessing 16 bit values on an odd, and cache line spanning, address)
Very deliberately I might add :) The point I was making was that cache considerations play a big role in optimising inner loops. The above code is an example written by Niklas Beisert a.k.a Pascal of Cubic Team to show the effects of cache misses when doing snazzy bitmap effects.
--- Hot Shot City is particularly good.
Latency is important to the extent that it limits bandwidth. As other posters point out, modern CPUs have many mechanisms for dealing with latency to a certain extent. (Some, such as Alpha, go to Herculean extremes with a gigantic reorder buffer and a cache which allows four or five outstanding misses to pend while still allowing hits in the cache.)
In the end, the raw amount of work performed is measured in terms of bandwidth. To process N items, you need to touch N items, and how quickly you touch those N items is expressed as bandwidth.
As a person who programs a deeply pipelined CPU, I can attest that latency can affect some algorithms (especially general purpose control algorithms) more than others, since it limits how quickly you can process a given non-bandwidth-limited task. However, for raw calculation (eg. all those graphics tasks and huge matrix crunching tasks the numbers folks like to run), those tasks are fairly latency tolerant and just need bandwidth.
This is why number-crunching jobs might work well with, say, RAMBUS, but desktop jobs might work better with DDR SDRAM, even if the two are at the same theoretical bandwidth node.
--Joe --Joe--
Program Intellivision!
You can have directory mechanisms (that could add up to gigabytes of additional memory on larger systems, not used for storage) to quickly look up locations in a centralized place. You won't find this sort of thing on a desktop PC (usually a 1-by or 2-by, anyway), but on the larger machines (e.g. 8 processors or more, 16GB of RAM or more) it's definitely an issue.
/ \
\ / ASCII ribbon campaign for peace
x
/ \
It's true that most optimizations are not portable, but there is something that can be done about this. Properly ifdef'd code can be optimized for the appropriate iteration of the platform by any moderately well-designed compiler. You don't want to be using runtime variables more than you have to, and certainly not for memory requests for caching, but you can handle it in the source code, provided that you actually let your users have the code so they can compile it themselves.
WARNING: there is a trojan on your
But surely it's latency that is of primary importance and bandwidth takes second place?
At first I thought you were right, but the more I thought about it, the more I feel the article is correct.
The best way to think of this is on high end systems. Think of a Sun Ultra Enterprise 450 with a couple gigs of RAM, a couple processors and a bunch large, memory and CPU intensive programs running.
You have 20 processes each which needs a slice of CPU time. Each time a process runs on the CPU, parts of the process are copied from RAM to the CPU to execute. That processes executes for its specified time slice, the kernel stops it, copies the results back to RAM, and then does it again with another process. Now imagine this happening hunderds of times per second! The memory bus gets even more saturated with more processors since there are more RAM to CPU copies and vice-versa.
This is where cache comes in. Part of the program that just executed is kept in the cache so the next time it's time slice comes around (ie: it's time to run on the CPU), there won't have to be a copy from RAM to the CPU. The CPU simply grabs it out of it's cache, thus freeing up bandwidth on the memory bus.
I cache my hifi remote control on the bed-side table.
My girlfriend flushes the cache, and puts it back to next to the hifi (this leads me to believe she has twice as many X chromosomes as necessary)
In this process, whatever state the remote is in one thing remains constant - there is only _one_ remote control..
In the olden days, RAM caches were not like that.
The cached information was duplicated in all lower cache levels. But, fortunately, if you read on in the article, you get to this:
"
Exclusive cache designs mean that the information contained within one layer is not contained within the layer above it. In the Thunderbird and Duron, this means that the information in the L2 cache is not contained within the L1 cache, and vice versa.
"
Now _that's_ more like the paradigm I'm used to.
I've been told it's a fair sized win for AMD's chips, but I'll reserve judgement until I get my hands on one. However, I'll say in public - I am a believer...
FatPhil
Also FatPhil on SoylentNews, id 863
fast:
mov dx,12
l1:
mov cx,32768
l2:
mov ax,[0]
mov ax,[0]
mov ax,[0]
dec cx
jnz l2
dec dx
jnz l1
slow:
mov dx,12
l3:
mov cx,32768
l4:
mov ax,[4095]
mov ax,[8191]
mov ax,[12287]
dec cx
jnz l4
dec dx
jnz l3
Nearly identical loops, except the first one flies and the second one thrashes because it misses the cache 100% of the time. Can you say 10-50 times slower?
--- Hot Shot City is particularly good.
Of course you are right in that cache does increase available bandwidth and that this is a good thing, but latency really is the thing we want to cure. For the original Athlon (though I'm guessing this is one of the faster models with 1/3 L2 dividor), 24 clock cycles go to waste even if the data is on L2, and even that is considerably faster than DRAM. Add to this that x86 is a really architecture because almost every instrucion can (and will) reference memory at least once (in addition to the mandatory instruction fetch).
--
> Am I missing something??
Yes. The lines in the exclusive L2 got there because they were kicked out of L1 by a miss. But there's still a chance that they'll be needed again, soon. That's why they're kept in L2. When a line gets thrashed out of L1, but manages to stay in L2, it can be quickly gotten when L1 needs it, again.
The same is essentially true of a normal, 'inclusive' L2 cache.
The reason an exclusive L2 *now* makes sense is that they're both on the same chip! That's the fundamental difference, because now it's cheap in terms of silicon and circuitry to do the bookkeeping between L1 and L2. This lets you avoid storing a second copy of the data in L1.
The living have better things to do than to continue hating the dead.
The AS/400 works like this. There is only one address space. It's stored on the disk. (The address space is so freaking huge that no practical amount of disk could cover it.) You can think of RAM as a cache for the disk - when you want to work on something (program segment, database record, whatever), it gets read into memory so it's closer to the processor, but it's still sitting at its address, and it gets written back out if it gets changed (managed by the hardware).
...phil
...phil
"For a list of the ways which technology has failed to improve our quality of life, press 3."
All you need is cache!
All you need is cache, cache!
Cache is all you need!
Ok, that was lame. But it's 9am and I haven't had my caffiene yet. What do you want, Pavarotti?
No boom today. Boom tomorrow. There's always a boom tomorrow. - Cmdr. Susan Ivanova
The article starts out talking of "bandwidth starved" processors. But surely it's latency that is of primary importance and bandwidth takes second place?
God does not play dice with the universe. Albert Einstein
Those who fail to understand communication protocols, are doomed to repeat them over port 80.
Unhappy with the performance of someone elses memory hungry code (50-100MB working RAM footprimt) I wrote my own version of the utility. It was only marginally faster than the original. However, I increased the performance by a factor of 10 when I realied that I could cache up jobs to do, then in turn perform those jobs on each 4MB chunk of data (Dec Alpha with 4MB L3 cache). I managed to increase performance even further by aiming the code at the L2 cache instead! The total number of bytes read/written was identical, but simply changing the order in whihc they were done increated performance 12-15 times.
However, as soon as you do take into account caching issues, you sometimes start making non-portable decisions. (not always though, as generally most architectures have the problem but the lines are simply drawn in different places).
FatPhil
Also FatPhil on SoylentNews, id 863
Just because there's no innovation doesn't mean it's a bad idea; large chunks of processor design and algorithms still come from the early 70's and are still valid. That said, there may be a revolution in cache design just around the corner...
--