Slashdot Mirror


Understanding Bandwidth and Latency

M. Woodrow, Jr. writes "Ars has a very eye-opening article on the real causes of bandwidth latency and why we should just drool endlessly over maximum throughput issues. In particular, I think the author's look into the PowerPC 970 and the P4's frontside bus is interesting considering how we're constantly being told by marketers that more speed is always going to translate into massive performance gains. The issue is, of course, far more complex, and this article does a good job of thinking about the problem from an almost platform agnostic point of view."

13 of 158 comments (clear)

  1. Re:Who cares? by FuzzyDaddy · · Score: 5, Interesting
    I like a faster computer for work. First, I do three dimensional finite element solves, which take lots of computing time. And the more computing power I have, the large the mesh size I want to use.

    Also, I've been doing a lot of numerical calculations in python, because the time saved writing the code is much greater than the time spent waiting for it to execute. Nevertheless, knocking down a run time from 7 hours would still be really nice, even if I have it running on someone else's computer. Even the five minute solves that could be reduced to 1 minute would make a difference - because five minutes isn't enough time to do something else.

    --
    It's not wasting time, I'm educating myself.
  2. Re:I can't wait... by Anonymous Coward · · Score: 3, Interesting

    Sorry if I made this topic a little unclear for some moderators. When a site is Slashdotted it isn't due to a "lack of bandwidth", you think that someone servers just "run out"? A site which is truely Slashdotted runs out of ram and processing power to keep the number of daemons alive to sustain the number of hits and the daemon itself crashes under the load, or gets so heavily bogged down it never recovers unless it itself is restarted. Therefore an article on CPU latency which is Slashdotted is ironic.

    Hope this helps.

  3. Re:Who cares? by Cipster · · Score: 1, Interesting

    There are also some scientific applications where more power is always needed. Some of the data mining we do at work wold not be possible without some serious computing muscle. It's even more of an issue for people working on protein structure and protein folding dynamics.

  4. Re:Bandwidth-questions? by Anonymous Coward · · Score: 2, Interesting

    What about crossbar switches like the newer video cards use?

    Doesn't some DRAM's have S-RAM caches built in?

    What about dual-ported ram?

    How about seperate buses for C&C and data?

    How about putting base instructions (zero out section of memory) into RAMs?

    Is there any memory ordering by the OS to facilitate BUS filling?

    Aren't you getting tired of these questions? :)

  5. Performance tip for software on modern processors by MichaelCrawford · · Score: 5, Interesting
    Here's a dead horse I've been beating for years.

    Much software is not written to take advantage of the architecture of modern microprocessors. If you rewrite some of your software to take advantage of them, then it is not hard to double your speed.

    The problem is that many, if not most programs are not very intelligent in how they access the CPU cache.

    It is not uncommon for a CPU to be running at ten times the speed of the memory bus. To keep from starving the CPU, we have caches that run nearer or at the speed of the processor.

    There's two problems. One is that the cache is limited in size. The other, less well understood, is that the cache comes in small blocks called "cache lines", that are typically 32 bytes.

    So if you have a cache miss at all, or you fill up the cache and have to write a cache line back to memory, your memory bus is going to be occupied for the time it takes to write 32 bytes. The external data bus of the PowerPC is 64 bits (8 bytes) so there will be four memory cycles, during which the processor is essentially stopped.

    What can you do to maximize performance? Make better use of the cache. If you use some memory, use it again right away. Use other memory that's right next to it. Avoid placing data values near each other that won't be used near each other in time.

    Simply rearranging the order of some items in a struct or class member list may make cache usage more effective.

    Also be aware of how your data structures affect the cache. Be aware of data you don't see, like heap block headers and trailers.

    Arrays are often more efficient than linked lists, especially if you are going to traverse them all at once, because each item in a linked list will likely be loaded in a different cache line, where an array may get several items together in a cache line.

    Finally, if you really have a structure that's full of small items that is accessed in a highly random way, consider turning off caching for the memory the data structure occupies. You won't get the benefit of the cache after you've accessed an item, but on the other hand you won't have to wait to fill a 32-byte cache line each time you read a single item.

    Imagine a lookup table of bytes that's several hundred k in size, accessed very randomly - you would benefit to not use the cache.

    --
    Request your free CD of my piano music.
  6. The miracle of cache by Anonymous Coward · · Score: 5, Interesting
    The article doesn't go into the miracles of modern cache architecture. It's impressive that memory that's about 50x too slow for its CPU can be made to work effectively at all.

    Once upon a time, on mainframes of the 1960s, minicomputers of the 1970s, and desktop computers of the 1980s, there was no cache. Every time the CPU wanted something from memory, it went all the way out to the memory bus (which, in early minis and PCs, was also the peripheral bus.) This was OK, because memory latencies were about 1000ns, and that was reasonably well matched to CPU speeds in the 1MhZ range.

    But today, we have 2GHz CPUs. We thus ought to have 0.5ns main memory to match, but what we have is about two orders of magnitude slower. The fact that modern systems are capable of papering over this issue is, when you think about it, a huge achievement. Of course, what really makes it go is that fast, but expensive, memory in the caches.

    Virtual memory hasn't done as well over the years. In the 1960s, the fastest drums for paging turned at around 10,000 RPM. Today, the fastest disks for paging turn at around 10,000 RPM. (Bandwidth is way up, but it's RPM that determines latency.) Meanwhile, real main memory has become about 20x faster, and main memory as seen by the CPU at the front of the cache is about 1000x faster. There's nothing cheaper than DRAM but faster than disk to use for a cache, so cacheing isn't an option. As a result, virtual memory buys you less and less as time goes on. With RAM at $100/GB, it's almost time to kill off paging to disk. Besides, it runs down the battery.

  7. Re:Anyone remember this by AvitarX · · Score: 5, Interesting

    I am probably being seriously trolled, but the guy was shown to be a total fraud.

    Wired had an article about it around the beginning of the year.

    All the sceptics were correct, and eventually the believers let the idea slip out of the collective conciousness, not wanting to have to admit they were totally duped.

    --
    Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
  8. Fairly Unimpressive by Kommet · · Score: 5, Interesting
    First, a caveat: I've been a regular Ars reader for the last two years. That said, I did not care for this article for the following reasons:
    • It was too shallow for the truely technical and too contorted for the uninitiated to follow. The author mixed metaphors, then piled confusing illustration atop constant admonitions not to let the illustration mislead you.
    • It tried to cover theory and therefore didn't include any real-world examples drawn from either modern or historic system designs with the exception of a short blurb about the Apple G3. It switched haphazardly from assuming a 3 cycle latency on memory reads to 9, then back to 3, then to 6, without explaining where those numbers came from. Graphs have large ranges with no explaination of whether one would ever see a situation that mimics the higher end of the graph.
    • It was not internally consistent. The choice of bus speeds in the bandwidth examples jumps back and forth between 100 MHz and 133 MHz, which mean that the examples cannot be compared to each other. Also, the illustrations show what the bandwidth usage would be for a 4 word burst, then shows a graph that goes into the low hundreds of words.

    Summing up, the article doesn't inform the technical, will confuse the non-technical, doesn't follow any consistent set of example conditions, contains very arbitrary graphs, and is generally poorly written. It is possible that I couldn't do any better (before I get flamed), but I doubt any technical writer worth his/her salt would do much worse.

  9. This is more important to modern game optimization by The+Optimizer · · Score: 5, Interesting

    I have worked on low-level systems for commercial PC games for over 6 years now.

    When I started in the mid 1990's the current thinking about optimization among those who cared was all about reducing cycle counts, and paring instructions for a Pentium. Memory system and bus behavior was mostly ignored or assumed to be rendered irrelevant by on-chip caches.

    During this time, while I was working on the graphics core for Age of Empires, I had lunch with Michael Abrash, who was at id software working on Quake at the time. While eating Mexican food, he casually mentioned the results of some memory bandwidth testing he had done and how he was shaping the rasterizer to make use of the time spent waiting on memory writes. This interested me enough to perform similar tests on my own work, and the results were telling.

    I wound up with core rendering code that, if you used the conventional cycle counting wisdom of the time, appeared to be slower than what it replaced... but in fact was faster, especially for various effects processing. Both games had very large hand-written assembly software rendering routines, in the size 10K+ lines.

    The reason for this of course was that memory bandwidth was being maxed out and with clever restructuring of code, it was possible to put the wait time to use on related processing, even if the code appeared to be more awkward and cumbersome that way. Though the exact memory behaviors would vary from system to system, one thing that was true and only got more so was that CPU speed was outstripping memory speed. Games like Quake and Age of Empires would have to process, in what usually amounts to a mutated memory copy, large amounts of textures or sprites each frame; so the data in question was pretty much guaranteed not be in the CPU caches.

    You would think that with the current generation of games using Hardware 3D only, this issue would be reduced to upload speed across the AGP Bus, but if Age of Mythology is any indication, that's not going to happen. In Age of Mythology we were able to make some significant performance gains by using the same techniques of coding to make the most of the slower speed and latency of main memory.

    As long the effort keeps paying off in increased FPS rates, we're going to be coding our games to account for and best deal with the realities of how the CPU relates to and waits on Cache and System memories.

  10. Caches old tech by Goonie · · Score: 3, Interesting

    Caches have been used in mainframes and minis since 1969, when the IBM 360/85 used it for exactly the same reason as modern CPUs need cache - the low-cost memory technology of the time (magnetic cores, IIRC) were much slower than the CPU, and memory that was fast enough was expensive.

    --

    Any sufficiently advanced technology is indistinguishable from a rigged demo
    --Andy Finkel (J. Klass?)
  11. The bliss of SDRAM banking by gwappo · · Score: 3, Interesting
    What I couldn't find in the article is that it is possible to reach the maximum rate to SDRAM as an SDRAM chip has multiple banks - eg. one can issue a command to one bank, while still receiving data from another.

    This avoids incurring latency since read commands can be issued in parallel to incurring the cas latency.

    There's more details on this in the SDRAM specification (lost the URL but it's out there - I think its Intel who wrote it though).

  12. Re:How Does Increasing FSB affect Performance? by EmagGeek · · Score: 3, Interesting
    Now what does this mean for "real world" performance? It means that many applications will see either a very small performance increase or none at all, as it is latency and not bandwidth that is the most important performance factor. Let us explain this in more detail.

    The real world scoop on this is that someone typing a document in OpenOffice or surfing the internet won't see any performance increase over a Pentium-II 233 MHz machine with 64MB of 60ns RAM. Gamers might get 46 gajillion frames per second instead of 42 gajillion frames per second, which is completely indistinguishable to humans, so they won't notice either

    I, on the other hand, might see a simulation of a 5250-node electromagnetic scattering problem take 36 hours instead of 39 hours, which is quite significant. But, I would probably get the same increase in performance by going through and cleaning up my code a little. FORTRAN is funny that way...

    Making computers faster to the nth power only makes code that's worse to the n+1th power :)

  13. Re:There are too many issues, and it gets too comp by Jay+Carlson · · Score: 5, Interesting

    Here we go again. I really don't have all day to poke holes in this, and because I'm actually trying to cite and verify I'm going to completely miss the moderation window, and lose readership. While some of the claims are correct, don't assume I agree with any of them just because I didn't refute.

    A good PCI-X capable Fiber Channel card on a mac [...]

    There are no Macs that support PCI-X. I am therefore suspicious of the numbers you claim for this configuration.

    Next, RC5. The rant here seems similar to another Anonymous Coward post back here; I'm not going to copy in my response again; quick summary: I didn't buy my computer to run RC5 really fast, and neither did you.

    Cold memory random read and write is FASTER on macs than DDR machines as seen in benchmarks but this author does hit upon that topic indirectly a little. Even if macs in Feb 2002 were faster than AMD for scatterred random read and write, the current 3 desktop macs all use DDR ram now so probably lack speed boost for that action, but do have write agregate (combined writes) across pci bus and other tricks.

    This paragraph is confused. Yes, "cold start" memory latency is very important for many tasks, and is often overlooked. But how is the first sentence be true when many Macs are DDR machines? And where are these benchmarks? I just went looking for DDR Mac latency scores and couldn't find anything. Does anyone have lmbench memory latency numbers for the Xserve or the current PowerMacs? Oh, and write combining is hardly a Mac trick.

    The hiddedn "backside only" cache of Pentium 4, and older macs, is the reason you could only have one cpu.


    Incorrect. You just need a cache coherency protocol between your processors. "Backside" has nothing to do with it. For example, the dual-processor Pentium III box I'm typing this on has "backside" cache on each processor; it's just hidden inside the CPU packaging rather than brought out to extra pins to connect to an external cache.

    There is no "PACK(1)" prgma for c structures on a mac.

    struct foo { char c; int i; } __attribute__ ((packed));
    struct foo foo_inst;
    main() { printf("%d\n", (int)&foo_inst.i - (int)&foo_inst); }


    happily returns "1" on 10.2. In fact, if i doesn't cross a double-word boundary, there is no penalty for use on later CPUs. Yes, I just verified this on the G4 downstairs.

    And RAM? Don't make me laugh! Try to find an AMD board that takes 4 gigabytes of RAM and USES it as fast as the fastest AMD can. every tweaker site says you can only use one 512MB part and have a max of 512MB.

    Although you can't get the absolute, topped out single-CPU performance with it, dual-CPU boards like the Tyan ThunderK7Xpro support up to 4G of registered PC2100 RAM now; these boxes still comfortably beat current top-end G4s at tasks like SPEC CPU2000. If you really want a lot of memory you'll have to get a box from a major vendor; the Dell PowerEdge 6650 comes to mind as a 16G machine. Unfortunately, there aren't any AMD boxes out there like this that I know of, but Hammer will change that.

    In 2002 no linux with any normal tweak allows a user task to hold and lock 1.5GB of reeal ram, its all virtual or fake.

    Get an Alpha. Although I have no direct experience with this, reliable sources claim you've been able to go past the 32-bit 4G address space limit for several years.

    thankfully apple is migrating to 40 bit address space physically soon in august with the new lightweight Power4.

    Why wait? Apple isn't the only vendor out there.