Understanding Bandwidth and Latency

Bandwidth by Patik · 2002-11-06 18:50 · Score: 5, Informative

Here's a handy bandwidth chart for common components to bookmark for easy reference.

Seems familiar... by josh+crawley · 2002-11-06 19:13 · Score: 4, Informative

This description of Bandwidth and Latency in CPU's and memory is almost the same as in network transmissions. Really easy to increase the bandwidth (10 Mbit to 100 Mbit to 1000MBit)... But try as hard as you can to make those electrons go faster along with the equipment...

Re:Performance tip for software on modern processo by Anonymous Coward · 2002-11-06 19:35 · Score: 2, Informative

Sounds like a job for the compiler to me, and btw, you never have to wait on the cache. The trick is to query the cache and memory at the same time for a data item. If it's in cache, then the memory request will be cancelled, if it's not in cache, then memory goes just as fast as it ever would. Cache is truly amazing in that if you are using a write-through scheme, it only provides a boost to performance...there's no speed-size tradeoff at all.

Why the compiler can't help you by MichaelCrawford · 2002-11-06 19:53 · Score: 5, Informative

I don't think you're going to be able to find a compiler that can reorder your struct or class members depending on how they are accessed. It may be possible to have one do that based on profiling, but I think that is beyond current compiler technology.

Also every compiler I have ever come across stores struct and class members in the order they are declared in the source file. I don't think that's guaranteed by either C or C++, but that's how it always is.

Also, the compiler is not going to make fundamental changes to your data structures and algorithms for you. If you write some code to manipulate a linked list, there's now way the compiler will change that to an array for you because it thinks it might be more efficient.

The one case I have seen tools able to affect cache access in a positive way is the use of code profilers that record the most common code paths in your program and then edit the executable binary so that all the less common code paths are towards the end of the file. Thus if you take an uncommon branch, you might jump back and forth a megabyte within a single subroutine.

Apple's MrPlus did that. It was based on an IBM RS-6000 tool whose name I don't recall.

This has the advantage not just of improving cache performance but of reducing paging - a greater percentage of the code pages that are resident in memory are used for something useful, rather than containing code that is mostly jumped over. Uncommonly used code will all be at the end of the file and may never be paged in.

One problem with a tool like this is that the results are only valid for a certain use of the program. If you have a program that can be used in many different ways, it may be difficult to find a test case that helps you.

--
Request your free CD of my piano music.

Re:Why the compiler can't help you by Anonymous Coward · 2002-11-06 21:33 · Score: 2, Informative

this is why the programmer should use the offsetof macro :-)
Re:Why the compiler can't help you by fstanchina · 2002-11-06 23:15 · Score: 2, Informative

Also every compiler I have ever come across stores struct and class members in the order they are declared in the source file. I don't think that's guaranteed by either C or C++, [...]

Yes, it is.
Re:Why the compiler can't help you by Ben+Hutchings · 2002-11-07 00:45 · Score: 4, Informative

Also every compiler I have ever come across stores struct and class members in the order they are declared in the source file. I don't think that's guaranteed by either C or C++, but that's how it always is.

That's guaranteed to happen to a group of non-static member variables with no access specifiers among them. So for example in:

class foo { public: int bar; int baz; private: int quux; };

'baz' is guaranteed to be placed after 'bar' in an instance of class foo; but 'quux' might not be placed after 'baz'.

There are too many issues, and it gets too complex by Anonymous Coward · 2002-11-06 20:00 · Score: 5, Informative

There are too many issues, and it gets too complex quickly.

For example, a few syncronization commands , and eieio paranoia when not needed in drivers can slow down IO.

A good PCI-X capable Fiber Channel card on a mac can get 49 microseconds per complete genuine 512 byte IO (over 20,000 IOs per second) and thats per channel, but just a few mistakes in the hardware interrupt handler or cache coherency misunderstood paranoia can add many microseconds.

Even the fastest direct IDE cannot get speeds that fast (49 microseconds).

And SCSI 320 barely does.

But what about REAL WORLD, as we al know from the press releases of RC5 competition a standard mac g4 laptop was over twice as fast as Pentium 4 desktop units.

In fact, apple only sells dual cpu systems now, and the ones they sold in Feb 2002 got over 21,129,654 RC5 keyrate for dual 1.0 ghz macs.

The fastest AMD boards, dual cpu, no l3 cache available, get only 10,807,034 RC5 keyrate!

half for AMD

way less than that for Pentium 4.

Why? The Pentium 4 lacks a good 32 bit barrel shifter.(4 clock latency on left shift!)

Why the AMD is so slow? Perhaps because no L3 cache but the object code and data set of RC5 benchmark (get source yourself)fits in AMD L2 cache.

Cold memory random read and write is FASTER on macs than DDR machines as seen in benchmarks but this author does hit upon that topic indirectly a little. Even if macs in Feb 2002 were faster than AMD for scatterred random read and write, the current 3 desktop macs all use DDR ram now so probably lack speed boost for that action, but do have write agregate (combined writes) across pci bus and other tricks.

Macs also have a lot of other little advantages to offset thepenalty of huge RISC instructions... a great C-language way of programming the SIMD execution engine (called Altivec by Moto) and its SIMD is very good. Its SIMD has a few very minor assists to the RC5, but as experts have shown, removing them competently does not cripple apples speed much.

The fastest macs have alwasy had the fastest GENUINE IO.

In fact, copying data in 1992 was twice as fast to do for real using RAID, than copying to dev/null (nothing transferred) on a high end SUN!

People complained that dev/null was not optimized.

the truth is that commands that xfer data using cache controller tricks and not using cpu registers on macs help out enormously. Motorola 040 machines xfer 128 bit aligned dat 16 bytes per cycle using the strange and special cache controller command (trick) called Move16.

move16 made the sun servers look slow and silly, not the badly written dev/nul.

in 1995 I saw with my own eyes 6 Seagate ST 12450W drives (each had two heads per surface very very rare drives) transfer almost 65 megabytes per second sustained on a high end mac.

that was 7 years ago, and the fastest PC for all the money you had with the fastest adaptec controller you could find and the best raid was : LESS THAN ONE FIFTH AS FAST.

And now in 2002 you have people endlessly worrying about AGP and PCI-X without understanding those are OUTPUT tweaks not INPUT sppedup tweaks, and people trying to speed up streaming speed of ram faster and faster without realizing that speed of L! and L2 cache are Key.

Or ability to SHARE the L2 cache amoungst multiple cpus.

The hiddedn "backside only" cache of Pentium 4, and older macs, is the reason you could only have one cpu.

having two fast, low voltage, high speed cpus or more is key to performance in 2003.

you cannot do this with Pentium 4, you need to use expensive xeons if you want 2 intel chips on one board, else use pentium 3.

And pricewatch this week shows a 800 Mhz itanium from intel (base model now) at over 7 thousand dollars.

7 thousand! no wonder 6 or 8 box vendors dropped plans to use itanium this year. Geeeez.

FAST L2 and L3 cache is where its at.

The latest mac cpus to come out in a couple months (not the Power4 based ones in august), the moto ones, will allow 4 megabytes of L3 cache instead of 2, and have a staggering 512K of L2 cache running at 1 ghz, instead of 500 Mhz.

I did not even think that was possible in todays world.

feeding a rick chip is harder than intel, because the data code cache only holds half as much logic with the wasteful 32 bit opcodes, but the ALIGNED data, the sweet wonderful mac world ALIGNED DATA help the mac enormously.

There is no "PACK(1)" prgma for c structures on a mac.

I am not kidding.

Its not part of the mac experience.

True, many fields are 2 byte aligned instead of 4 byte aligned at times, but since 1995 apple has stressed 32 bit aligned integers and 64 bit aligned qauds religiously.

Macs perform well because of ALIGNMENT of structures.

Do archetecture people understand how many obscene PACK(1) (8 bit aligned) structures there are in Win32?

do they even code on multiple systems?

I do. If you use a 64 bit integer that is 2 byte aligned on a Pentium and pass it as argument to MS Win32 it will silently fail in some of its timer routines. That never happens on a Mac, plus mac routines tend to paranoia check a little more often on input, but not always.

multiple registers helps a coder
multiple registers helps assemly coders avoid push-pop hell

people need to think about those things too before wasting time religiously bragging about high end streaming speed of RAM.

ever timed REAL IO? Real IO pumped from card to card faster using good DMA back-to-back faster than could ever be moved using conventional single registers?

architecture is all about asking why?
Why use floppy disks in 2002?
Why use big hot parallel printer connectors in 2002 or ever ( IBM CHRP ref spec demanded it on hand helds!)
(IBM "PREP" spec required centronics connector on handhelds too!!!, MS Win 95 spec insisted on it strongly, but said SCSI was not highly important)

Why use ISA in 2002?
Why use hot hot steamy chips that do lots of speculative branching eating up power? Apples fastest machines use microcontrollers. I kid you not. They are using MICROCONTROLLER cpus with very very shgort pipelines and very very little speculative branching and very low power requirements

Why use PS2 keyboards?
Why insist on VGA at boot?
Why insist on legacy BIOS calls that have no relevence except for anciet OSes taht are not even guranteed to run by motherboard vendors?

I respect legacy too, but the legacy of Apple spurned all of these in 1984. Yup. macs never had any of that slop, though they do have open-boot style pci, and now use vga style connectors (though the connectors have detect diaodes in them to see waht size monitor tyou have), and have IDE now as default drive, though very fast performing vs pci bus contention. In fact apples 14 drive server uses 14 IDE controller chips for each of the 14 IBM GXP120 gig drives. 14 chips! 14 masters! Each pumping 35 megabytes sustained or more, and for only 15,000 bucks with fiber channel. Unfortunately its a 3U, but the drives are cold.)

I think its funny that people try to write papers yapping about things that can change rapidly in one or two years, or have little bearing on true io speeds.

The sad truth is that right now... RIGHT NOW... in 2003 November not ONE motherboard on pricewatch or for sale that I know of supports PCI-X, except for rich-man XEON and rich-man itanium.

NONE.

No Pentium 4 with PCI-X, no mac (though apple X-Serve is 488 megabytes per second per slot), and no MP AMD and no AMD thunderbird class.

Just vapor-hardware and promises for 3 straight years.

Now AMD said they will give fast PCI only to Hammer chips and hammer chips are getting horrible benchmark speeds.

Does anyone reaslize how pathetic PCI slots are in 2002?

I have in my machine 3 different pci-X cards and i have to run all of them at slower speeds even though some are capable of 770 megabytes per second bidirectionally (in-out simultaneous), at 133Mhz.

This world sucks.

And RAM? Don't make me laugh! Try to find an AMD board that takes 4 gigabytes of RAM and USES it as fast as the fastest AMD can. every tweaker site says you can only use one 512MB part and have a max of 512MB.

Thats insane. i have not one machine with less than 768 MB in this house and my main mac from 1995 supported and allwoed a single user proccess to hold and lock (physical real ram) 1.5 MB of memory.

In 2002 no linux with any normal tweak allows a user task to hold and lock 1.5GB of reeal ram, its all virtual or fake.

Even most UNIX never allow more than 3 GB of physical REAL RAM in total usage ever... its all wasted for bad VM designs.

nobody cares. Everyone says "I know 7 different unix OS that support 4 GB of ram" and then you have to reming them that VM is not RAM and physical RAM can be easily proven to be there or not and that no intel unix allows tasks to utilize 1.5 gigabytes of real physical RAM normally. And even if netbsd is hacked it runs no shrinkwrapped software. all shrinkwrapped software is mac or windows.

thankfully apple is migrating to 40 bit address space physically soon in august with the new lightweight Power4.

does anyone think that this nightmare of not physical ram in osses is real problem or not?

sure NT has a /3 3 gb switch and another version allows bank switching 16 GB of ram slowly, but no NT system allows a sungle process to utilize over 1 GB of real physical genuine RAM (critical for FTDT 3d energy simulations).

Arrrgh! I hate all this least common-denominator lowest-cost-component world.

Fake powersupplies that lie about ratings over 450 watts

cheap-ass capaciters that heat blow and leak beacuse tantalum costs too many extra cents

traces that corrode instantly in salt air near ANY coast, especially in florida

fans that silently die and expensive fans doing the same

drives that have 34% failurerates after 18 months of usage (Fujistu lawsuit, IBM lawsuit)

And to think that people try to make themselves feel good that they can move memory from one area to another quickly using ram streaming commands. BIG DEAL! Try moving it to a disk drive ro through a network connector or to another CPU. (many multi cpu designs cap inter cpu speed to 50% or 25%).

who cares about ram streaming! bus contention, pci latency, and cold ram jump reading are far more critical issues.

But no one cares. They just want to download mp3, porn, dvdrips, and console warez and you can do that on any 5 year old box.

What a terrible world when a hard drive from seagate in 1995 allowed 12 megabytes per second SUSTAINED and in Nov 2002 the fastest single spindle drives sustain only 39 MB per second or so.

What garbage.

And the PCI bus is not 50 times faster after all these years, or 40 times, or 20 times faster, or 10 times faster, its so slow even at 64 bit, 64Mhz I want to just cry.

Self-modifying code is bad... by Goonie · 2002-11-06 20:16 · Score: 4, Informative

This is obviously a troll, but seeing it's been moderated up I should warn the kiddies out there that this is a *bad idea*...

Self-modifying code makes pipelining, branch prediction, instruction cacheing (particularly on SMP systems) and a bunch of other things dangerous, and just slows down the processor as it checks for and deals with it. IIRC some architectures don't even explicitly check for it anymore and die horribly if you try it.

Aside from the fact that trying to debug self-modifying code is just asking for fscking trouble....

--

Any sufficiently advanced technology is indistinguishable from a rigged demo
--Andy Finkel (J. Klass?)

Re:It can be done better with self-modifying code by wik · 2002-11-06 20:20 · Score: 5, Informative

Self-modifying code is a horrible burden for the L1 caches. If you allow writes to code pages, the processor must treat the writes as data writes in the L1 D-cache. This means that there are now two different versions of the same cache line in the cache heirarchy, which means you need to keep them coherent. This means there has to be coherency between the L1 I-cache and L1 D-cache. Yuck.

It's going to take more than 1 cycle to keep those lines coherent, which is going to increase your average I-cache latency (and is exactly what you're trying to avoid). You really don't want to do this on modern processors. Besides, if your inner loop is big enough to thrash in your I-cache, you've got bigger problems (pun intended)... and if it's not big enough, you're not going through that slow memory bus, are you?

Bottom line: self-modifying code is a bad idea.

Second bottom line: Modern Java JITs end up doing this sort of thing, which gives computer architects a major headache!

--
/ \
\ / ASCII ribbon campaign for peace
x
/ \

But self modifying code is fun! by MichaelCrawford · 2002-11-06 20:23 · Score: 3, Informative

We couldn't have viruses without self-modifying code. What would all the teenagers do?

Used with care, self-modifying code is a powerful and useful tool. And yes there are caching issues - most processors have separate data and code caches, so writing into code using data instructions will put the code into the wrong cache, so you have to flush it.

We couldn't have program loaders without self-modifying code!

A number of the products I wrote for the Mac back at Working Software were self-modifying code, and they did very well.

You just have to know what you're doing, that's all.

Another use for them is dynamically relinking a running program as you edit its source code. Instead of relinking and relaunching the whole program, you can just reload the last subroutine that you edited. This is done by a number of development environments, and can greatly speed up the edit-compile-debug cycle.

--
Request your free CD of my piano music.

Calculating Latency by SailorBob · 2002-11-06 21:07 · Score: 5, Informative

From:
Ace's Guide to Memory Technology

Basically, the latency of the whole memory (From FSB to DRAM) system is equal to the sum of:

The latency between the FSB and the chipset (+/- 1 clockcycle)
The latency between the chipset and the DRAM (+/- 1 clockcycle)
The RAS to CAS latency (2-3 clocks, charging the right row)
The CAS latency (2-3 clocks, getting the right column)
1 cycle to transfer the data.
The latency to get this data back from the DRAM output buffer to the CPU (via the chipset) (+/- 2 clockcycles)

This gets you the first word (8 bytes). A good PC100 SDRAM CAS 2 will have a latency of about 9 cycles, and the next 3 cycles another 24 bytes will be ready. The PC100 SDRAM will, in this case be able to get 32 bytes in 12 cycles.

If you want to calculate the latency that CPU sees, you need to multiply the latency of the memory system with the multiplier of the CPU. So a 500 MHz (5 x 100 MHz) CPU will see 5 x 9 cycles latency. This CPU will have to wait at least 45 cycles before the information that could not be found in the L2-cache will be available in the cache.

--

Woopty Doo Basil, what does it all mean?!

How Does Increasing FSB affect Performance? by SailorBob · 2002-11-06 21:26 · Score: 3, Informative

More From Ace's:

Athlon XP 2800+: 333 MHz FSB and nForce 2

First of all, we tested the Athlon XP 2800+ on the "normal" KT333 platform with a 17x multiplier, the FSB set at 133 MHz DDR (266 MHz) and the memory set at 166 MHz DDR (333 MHz), CAS at 2, RAS to CAS at 3, Precharge at 3. The second time, the KT333 platform (ASUS A7V333) was set at a FSB of 166 MHz (333 MHz) and the multiplier was set to 13.5x.

...

Where do I start? There is an enormous amount of info hidden is this table. Let us first start with the 266 MHz versus 333 MHz FSB discussion.

There have been many reports that show that the Athlon does not benefit much from an increase in FSB clockspeed, moving from 266 MHz to 333 MHz. But Membench tells us exactly why. First of all, compare the two KT333 latency numbers (64 byte strides). All BIOS settings were exactly the same, only the FSB speed, and thus the multiplier, are different. Normally one would expect, everything else being equal, that the Athlon with the 166 MHz FSB would see 25% lower latency, but the CPU with the 166 MHz FSB version actually sees a higher latency! This shows that the (ASUS) KT333 board, in order to guarantee proper stability, increases certain latencies of the memory controller. Memory bandwidth increases by 14%, which is also less than expected.

Now what does this mean for "real world" performance? It means that many applications will see either a very small performance increase or none at all, as it is latency and not bandwidth that is the most important performance factor. Let us explain this in more detail.

--

Woopty Doo Basil, what does it all mean?!

Re:Bandwidth - Useless Without Latency by SailorBob · 2002-11-06 22:05 · Score: 3, Informative

You're making the classic mistake of assuming that Bandwidth is what matters, when in fact it's the latency which kills applications. Having bandwidth numbers without the corresponding latency is worse than useless, it's misleading for the uninformed.

--

Woopty Doo Basil, what does it all mean?!

Read this by chthon · 2002-11-06 22:32 · Score: 2, Informative

Everyone who is interested in issues of bandwidth and latency should read this book :

Computer Architecture : A Quantitative Approach, by John L. Hennessy and David A. Patterson

Re:I can't wait... by virtual_mps · 2002-11-07 02:45 · Score: 2, Informative

When a site is Slashdotted it isn't due to a "lack of bandwidth", you think that someone servers just "run out"? A site which is truely Slashdotted runs out of ram and processing power

Sure that's true, except when it isn't. I've seen a site get /.'d, and the machine was fine--but the entire organization where the machine was located ran out of bandwidth. Local users could acess the web site but traffic to and from the internet was halted. It really depends on your mix of static/dynamic pages, and the average request size. For a static site it's fairly easy to max out a 100Mbit lan--which is more internet bandwidth that most people outside of hosting facilities can easily obtain.

latency measurement by Anonymous Coward · 2002-11-07 03:02 · Score: 1, Informative

Use lmbench: www.bitmover.com/lmbench
It will tell you what the real latency of your
system is, including the cycles inside the CPU
before the processor drives the request on the bus.

Re:There are too many issues, and it gets too comp by leonbrooks · 2002-11-07 03:34 · Score: 3, Informative

In 2002 no linux with any normal tweak allows a user task to hold and lock 1.5GB of reeal ram, its all virtual or fake.

False for not-pretend 64-bit architectures (e.g. UltraSparc) and has been for years.

--
Got time? Spend some of it coding or testing

Re:I can't wait... by CyberBry · 2002-11-07 03:47 · Score: 2, Informative

Ars gets slashdotted all the time - I've never seen their server even flinch.

--

----
Bryan Samis
http://www.thesamis.net

Re:Performance tip for software on modern processo by sql*kitten · 2002-11-07 05:07 · Score: 3, Informative

Much software is not written to take advantage of the architecture of modern microprocessors. If you rewrite some of your software to take advantage of them, then it is not hard to double your speed.

It's worse than you think on PCs (whatever OS they're running). The article talks about "bus mastering" and "data tenure", but on real workstation-class hardware there is no bus (not even one with a "north bridge") there'a a proper switch, like Crossbow or GigaPlane. These give you point-to-point, non-blocking sustained peak I/O. On a switched system, if components A and B want to communicate they can do so at the switch's full speed, and so can components C and D, no contention at all. That means no wasted cycles for the bus to constantly change ownership.

If you're doing a job that requires heavy use of the "bus" on an x86 system (lots of storage I/O, lots of random memory access hence lots of L2 misses), then optimizing code for cache locality is the least of your problems, you'll never get around the fact that the inefficient design of the hardware itself is the bottleneck. Fancy FSBs and the like are just workarounds and don't address the real problem.

Slashdot Mirror

Understanding Bandwidth and Latency

20 of 158 comments (clear)