Understanding Bandwidth and Latency

Bandwidth by Patik · 2002-11-06 18:50 · Score: 5, Informative

Here's a handy bandwidth chart for common components to bookmark for easy reference.

Re:Bandwidth by Courageous · 2002-11-06 19:27 · Score: 3, Insightful

Actually, this was quite an interesting chart. Seeing that Ultra-320 is twice PCI, I wonder if A) anyone makes Ultra-320 for PCI, and B) anyone is stupid enough to buy it?

C//

it's easy ... by ZeekWatson · 2002-11-06 18:51 · Score: 4, Funny

latency causes there to already be 100 posts when you bring up the comments page ... and you thought you were first! :)

seeing is believing by Anonymous Coward · 2002-11-06 18:53 · Score: 0

"Ars has a very eye-opening article on the real causes of bandwidth latency

I for one think that an eye-opening article is going to increase my visual bandwidth latency ;)

I can't wait... by Grip3n · 2002-11-06 18:56 · Score: 1, Funny

until this site is Slashdotted...

--
To make a pun demonstrates the highest understanding of a language

Re:I can't wait... by Anonymous Coward · 2002-11-06 19:06 · Score: 3, Interesting

Sorry if I made this topic a little unclear for some moderators. When a site is Slashdotted it isn't due to a "lack of bandwidth", you think that someone servers just "run out"? A site which is truely Slashdotted runs out of ram and processing power to keep the number of daemons alive to sustain the number of hits and the daemon itself crashes under the load, or gets so heavily bogged down it never recovers unless it itself is restarted. Therefore an article on CPU latency which is Slashdotted is ironic.

Hope this helps.
Re:I can't wait... by hdparm · 2002-11-06 20:08 · Score: 2

This is not OT. Please moderate parent up. I guess +1, Insightful would be the best fit, although in this case I would prefer +1, Clever.
Re:I can't wait... by fraxas · 2002-11-07 00:49 · Score: 1

Except that in a lot of cases, sites do "run out" of bandwidth.
Many ISPs have bandwidth caps.
Re:I can't wait... by virtual_mps · 2002-11-07 02:45 · Score: 2, Informative

When a site is Slashdotted it isn't due to a "lack of bandwidth", you think that someone servers just "run out"? A site which is truely Slashdotted runs out of ram and processing power

Sure that's true, except when it isn't. I've seen a site get /.'d, and the machine was fine--but the entire organization where the machine was located ran out of bandwidth. Local users could acess the web site but traffic to and from the internet was halted. It really depends on your mix of static/dynamic pages, and the average request size. For a static site it's fairly easy to max out a 100Mbit lan--which is more internet bandwidth that most people outside of hosting facilities can easily obtain.
Re:I can't wait... by CyberBry · 2002-11-07 03:47 · Score: 2, Informative

Ars gets slashdotted all the time - I've never seen their server even flinch.

--

----
Bryan Samis
http://www.thesamis.net

Re:Who cares? by FuzzyDaddy · 2002-11-06 19:00 · Score: 5, Interesting

I like a faster computer for work. First, I do three dimensional finite element solves, which take lots of computing time. And the more computing power I have, the large the mesh size I want to use.

Also, I've been doing a lot of numerical calculations in python, because the time saved writing the code is much greater than the time spent waiting for it to execute. Nevertheless, knocking down a run time from 7 hours would still be really nice, even if I have it running on someone else's computer. Even the five minute solves that could be reduced to 1 minute would make a difference - because five minutes isn't enough time to do something else.

--
It's not wasting time, I'm educating myself.

Re:Who cares? by Grip3n · 2002-11-06 19:02 · Score: 3, Insightful

"64k ought to be enough for anyone." - You know who

Remember, we are living in a day when there are massive amounts of progress. Just because you cannot see any immediate use for the processing power doesn't mean there isn't any anywhere or that you won't need it in the future.

What about 3D animators? Compile times? People in the print field who deal with massive 300DPI images? What about actually being able to have that Microsoft Paperclip run without 100% CPU usage?

Games are certainly pushing the CPU mark along one area, but remember computing isn't just limited to home use as well.

Hope this helps.

--
To make a pun demonstrates the highest understanding of a language

never trust the back of the box. by Anonymous Coward · 2002-11-06 19:05 · Score: 0

I wonder if the marketing people who write about memory bandwidth know the marketing people who think a gigabyte is a billion bytes.

They're in cahoots I tell you; cahoots!

Re:never trust the back of the box. by ThaReetLad · 2002-11-06 20:45 · Score: 2, Insightful

A gigabyte IS a billion bytes. Read the SI definition of Gigabyte
While we're on the subject Ars talks about 8 bytes as being called a "word". As a programmer I was under the impression that a "word" is 2 bytes, a Double Word (DWORD) was 4 bytes and a Quadword was 8 bytes or 64 bits. What's he on about?

--
You can't win Darth. If you mod me down, I shall become more powerful than you could possibly imagine
Re:never trust the back of the box. by Catskul · 2002-11-06 20:56 · Score: 2

I believe word size depends on the processor. I.e. a

32 bit processor has 32 bit or 4 byte word and a
64 bit processor has 64 bit or 8 byte word.

--

Im not here now... Im out KILLING pepperoni
Re:never trust the back of the box. by Anonymous Coward · 2002-11-06 21:51 · Score: 0

Note that NIST is not SI. The binary multiple prefixes are from an IEEE publication, specifically IEC 60027-2. The standards organization responsible for the SI units is the Bureau International des Poids et Mesures.
Re:never trust the back of the box. by joto · 2002-11-06 21:55 · Score: 3, Insightful

Well, obviously a "word" will be different from processor to processor, since different processors have different wordsizes. An 16-bit processor will have a word as 16 bits, a 32-bit processor will have a word as 32 bits, and a 64-bit-processor will have a word as 64 bits.
Pentium is a 32-bit processor, but for historical reasons, Intel still calls 16-bits a word, and 32-bits a dword. This is only to confuse you, pay no attention to the marketing behind it...
In a side note, it could be said that when C was designed, an "int" really was intended to be a "word". For compatibility reasons, most 64-bit processors now have their words (64-bits) as either "long" or "long long", since everyone for the last decades have assumed that an int is 32 bits.
Of course, that was for processor words, which are most interesting for the programmer. But in this article, it was the memory bus that was discussed, and it is of course allowed to define what it thinks is a word (how many bits will be transferred at once, when you read or write a memory address).
Having the memory bus being 64 bits wide on a 32 bit processor is perfectly sane and acceptable, as is having the memory bus 16 bits wide on a 32 bit processor (I believe the 386 did this).
In the end, because of marketing, and other reasons, it's best to not use "word" at all. Personally, I see "word" as something that when put together results in speach or text.

Re:Who cares? by Cipster · 2002-11-06 19:06 · Score: 1, Interesting

There are also some scientific applications where more power is always needed. Some of the data mining we do at work wold not be possible without some serious computing muscle. It's even more of an issue for people working on protein structure and protein folding dynamics.

i thought you had to buy latency at fry's by Anonymous Coward · 2002-11-06 19:07 · Score: 0

although i hate slashdot because of its latency i like the articles on slashdot. and the way i see it even if the speed on these faster computers goes up if the latency dosent go sown then you will see nominal porfromance gains ,, i think i heard once that the p4 had tons more latency than ddr memory could someone clarify

Re:i thought you had to buy latency at fry's by TracerJPN_USMC · 2002-11-07 00:53 · Score: 0

*blank stare in your direction*

--
magnanomous.

Re:Who cares? by Ironica · 2002-11-06 19:08 · Score: 3, Insightful

I know this is a bit off-topic, but...how many of you actually need a computer that's faster than the top of the line out there now?

Well, let's just say you decide to buy a new computer. And the sales guy starts telling you that you could get this one for $1300, but this over here is much "faster" and it's only $1600. (Yeah, I know, I build my own too. But this article is written to be accessible to those who may not be quite so handy.) Knowing what those numbers mean is very important in making that decision. It will help people realize that they don't need that speed, just as you mention.

--
Don't you wish your girlfriend was a geek like me?

Re:Who cares? by The+Optimizer · 2002-11-06 19:11 · Score: 4, Insightful

One thing I have found though, and it may apply to your work, is that when the process is computation intensive, particularly 80-bit precision FPU intensive, then the FPU processing time is sufficently large enough to mask variations in memory designs. That is, you're FPU speed bound not memory bound.

Anyone remember this by Rooked_One · 2002-11-06 19:12 · Score: 3, Funny

There was a guy who demonstrated a way to transmit data over the electromagnetic field surrounding every powerline. ALl you do is plug your computer into a power outlet, basically. The throughput was incredible, and latency everywhere would be under 10ms, as they demonstrated.

Anyone hear from these guys lately, or at least know a url, if they havn't been bought out be the telecoms?

Re:Anyone remember this by AvitarX · 2002-11-06 19:20 · Score: 5, Interesting

I am probably being seriously trolled, but the guy was shown to be a total fraud.

Wired had an article about it around the beginning of the year.

All the sceptics were correct, and eventually the believers let the idea slip out of the collective conciousness, not wanting to have to admit they were totally duped.

--
Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
Re:Anyone remember this by mamba-mamba · 2002-11-06 19:24 · Score: 2, Insightful

Wired magazine wrote an article about a company (which was mostly just a front for one man) that fits your description.

It was pretty clear from the article that the guy was a crook and that there was nothing to his claims. But he got a lot of money from a lot of supposedly smart people.

By the way, what does the claim that "latency everywhere would be under 10ms" mean?

MM
--

--
By including this sig, the copyright holders of this work or collection unreservedly place it in the public domain.
Re:Anyone remember this by Anonymous Coward · 2002-11-07 05:34 · Score: 0

You're probably thinking of Media Fusion. As far as I know they're still busily separating VC's from their money giving nothing but happy fiction in return.
They have now gone SEVEN years without producing anything whatsoever.
Re:Anyone remember this by crapulent · 2002-11-08 02:10 · Score: 1

You might be referring to the Magix Box hoax, where a Florida man got millions of dollars in investments (from folks like Blockbuster, Intel, and Ted Turner's son) for this mysterious device that supposedly transmitted data over regular phone lines at very fast rates. There was also a Slashdot story about the device.

Seems familiar... by josh+crawley · 2002-11-06 19:13 · Score: 4, Informative

This description of Bandwidth and Latency in CPU's and memory is almost the same as in network transmissions. Really easy to increase the bandwidth (10 Mbit to 100 Mbit to 1000MBit)... But try as hard as you can to make those electrons go faster along with the equipment...

Re:Seems familiar... by Anonymous Coward · 2002-11-07 03:57 · Score: 0

It's really easy to increase the memory bandwidth of modern computer systems, but the problem is that it is EXPENSIVE to do this. Anyone who has read the most basic undergraduate text in computer architecture (Hennessey and Patterson) knows how to increase the bandwidth of a computer system.

To get throughput, you can go narrow and fast or wide and slow. There are tradeoffs for each approach.

People aren't willing to pay for premium bandwidth. Look at how RAMBUS Pentium 4 based systems have flopped in the marketplace.

The trick is to balance processor speed and memory bandwith to provide the performance at a price that people are willing to pay for.

Re:Bandwidth-questions? by Anonymous Coward · 2002-11-06 19:15 · Score: 2, Interesting

What about crossbar switches like the newer video cards use?

Doesn't some DRAM's have S-RAM caches built in?

What about dual-ported ram?

How about seperate buses for C&C and data?

How about putting base instructions (zero out section of memory) into RAMs?

Is there any memory ordering by the OS to facilitate BUS filling?

Aren't you getting tired of these questions? :)

Re:Who cares? by tx_mgm · 2002-11-06 19:15 · Score: 1

what kind of car do you drive? is it an SUV? a sports car? or do you drive the car that gets the best fuel mileage in the world and don't care if it can't go any faster than 65mph (since you don't need to anyway)? good for you if you do....but, its the same concept as that (IMHO).

but honestly, this kind of stuff needs to be looked at. right now, computers dont operate to their potential, and if it wasn't for people out there constantly looking for ways to improve then we would still be using computers the size of office buildings that take half of the city's power just to run it...oh, and it would operate slow as hell. do you have a laptop? how about a PDA? cell phone? these things are only here now and have all of those sexy features because of people who are constantly trying to make things run as efficiently as possible. oh, and you're right. it does make the games better, too.

--
Gentlemen...BEHOLD!
-Dr. Weird

Re:Who cares? by Anonymous Coward · 2002-11-06 19:15 · Score: 0

With faster computers comes more functionality.
Sure, you can talk about bloated software but think of all the things it can do that weren't even possible before. Faster computers get you that. New stuff to do! There are things that we can't do now that would be real cool to do. Because the computers aren't fast enough. Don't think about the stuff you DID with a computer. Or it DOES now. Think about what it CAN do, what it MIGHT do if it just had more juice.

Performance tip for software on modern processors by MichaelCrawford · 2002-11-06 19:16 · Score: 5, Interesting

Here's a dead horse I've been beating for years.

Much software is not written to take advantage of the architecture of modern microprocessors. If you rewrite some of your software to take advantage of them, then it is not hard to double your speed.

The problem is that many, if not most programs are not very intelligent in how they access the CPU cache.

It is not uncommon for a CPU to be running at ten times the speed of the memory bus. To keep from starving the CPU, we have caches that run nearer or at the speed of the processor.

There's two problems. One is that the cache is limited in size. The other, less well understood, is that the cache comes in small blocks called "cache lines", that are typically 32 bytes.

So if you have a cache miss at all, or you fill up the cache and have to write a cache line back to memory, your memory bus is going to be occupied for the time it takes to write 32 bytes. The external data bus of the PowerPC is 64 bits (8 bytes) so there will be four memory cycles, during which the processor is essentially stopped.

What can you do to maximize performance? Make better use of the cache. If you use some memory, use it again right away. Use other memory that's right next to it. Avoid placing data values near each other that won't be used near each other in time.

Simply rearranging the order of some items in a struct or class member list may make cache usage more effective.

Also be aware of how your data structures affect the cache. Be aware of data you don't see, like heap block headers and trailers.

Arrays are often more efficient than linked lists, especially if you are going to traverse them all at once, because each item in a linked list will likely be loaded in a different cache line, where an array may get several items together in a cache line.

Finally, if you really have a structure that's full of small items that is accessed in a highly random way, consider turning off caching for the memory the data structure occupies. You won't get the benefit of the cache after you've accessed an item, but on the other hand you won't have to wait to fill a 32-byte cache line each time you read a single item.

Imagine a lookup table of bytes that's several hundred k in size, accessed very randomly - you would benefit to not use the cache.

--
Request your free CD of my piano music.

The miracle of cache by Anonymous Coward · 2002-11-06 19:16 · Score: 5, Interesting

The article doesn't go into the miracles of modern cache architecture. It's impressive that memory that's about 50x too slow for its CPU can be made to work effectively at all.

Once upon a time, on mainframes of the 1960s, minicomputers of the 1970s, and desktop computers of the 1980s, there was no cache. Every time the CPU wanted something from memory, it went all the way out to the memory bus (which, in early minis and PCs, was also the peripheral bus.) This was OK, because memory latencies were about 1000ns, and that was reasonably well matched to CPU speeds in the 1MhZ range.

But today, we have 2GHz CPUs. We thus ought to have 0.5ns main memory to match, but what we have is about two orders of magnitude slower. The fact that modern systems are capable of papering over this issue is, when you think about it, a huge achievement. Of course, what really makes it go is that fast, but expensive, memory in the caches.

Virtual memory hasn't done as well over the years. In the 1960s, the fastest drums for paging turned at around 10,000 RPM. Today, the fastest disks for paging turn at around 10,000 RPM. (Bandwidth is way up, but it's RPM that determines latency.) Meanwhile, real main memory has become about 20x faster, and main memory as seen by the CPU at the front of the cache is about 1000x faster. There's nothing cheaper than DRAM but faster than disk to use for a cache, so cacheing isn't an option. As a result, virtual memory buys you less and less as time goes on. With RAM at $100/GB, it's almost time to kill off paging to disk. Besides, it runs down the battery.

Re:The miracle of cache by Cuthalion · 2002-11-06 20:27 · Score: 3, Insightful

With RAM at $100/GB, it's almost time to kill off paging to disk. Besides, it runs down the battery.

I agree with you except that having a gig or more of RAM won't exactly do wonders for your battery life either.

--
Trees can't go dancing
So do them a big favor
Pretend dancing stinks!

Re:1000th post for Bandwidth by Anonymous Coward · 2002-11-06 19:18 · Score: 0

Congrats on your 1000th post, but it's also an accomplishment to have Bruce Perens mark you as a foe. Klerck and you are his only ones.

Here here! by aztektum · 2002-11-06 19:19 · Score: 4, Insightful

I work begrudgingly in a CompUSA store and the customers are suckered by our sales guys, yet I need the $ so I just suck it up and feel dirty.

I just wish this information was more well distributed, and that people would actually research what they're getting into. They treat it like clothes shopping, they stop in and take something that looks cool home, but if you're plunking down a big wad of money you should research, sadly they don't, then they're pissed when they realize a week later they were suckered. And since I work customer service counter I get to play whipping boy.

(On a truly sad note this one customer swore at me and said it was "Horse shit..." that we didn't carry Dell even though I said it wasn't our decision to make...)

--
:: aztek ::
No sig for you!!

Fairly Unimpressive by Kommet · 2002-11-06 19:22 · Score: 5, Interesting

First, a caveat: I've been a regular Ars reader for the last two years. That said, I did not care for this article for the following reasons:

It was too shallow for the truely technical and too contorted for the uninitiated to follow. The author mixed metaphors, then piled confusing illustration atop constant admonitions not to let the illustration mislead you.
It tried to cover theory and therefore didn't include any real-world examples drawn from either modern or historic system designs with the exception of a short blurb about the Apple G3. It switched haphazardly from assuming a 3 cycle latency on memory reads to 9, then back to 3, then to 6, without explaining where those numbers came from. Graphs have large ranges with no explaination of whether one would ever see a situation that mimics the higher end of the graph.
It was not internally consistent. The choice of bus speeds in the bandwidth examples jumps back and forth between 100 MHz and 133 MHz, which mean that the examples cannot be compared to each other. Also, the illustrations show what the bandwidth usage would be for a 4 word burst, then shows a graph that goes into the low hundreds of words.

Summing up, the article doesn't inform the technical, will confuse the non-technical, doesn't follow any consistent set of example conditions, contains very arbitrary graphs, and is generally poorly written. It is possible that I couldn't do any better (before I get flamed), but I doubt any technical writer worth his/her salt would do much worse.

Re:Fairly Unimpressive by Huge+Pi+Removal · 2002-11-06 22:06 · Score: 2

Hmmm, maybe the article was aimed at people like me. It was interesting to have a peek at the guts of a computer, but luckily I'm technical enough not to get confused by his rather odd illustrations.

I think the pictures and graphs did their job (he chose those analogies for a reason), but you have to be on the ball.

All in all, a good read for a sysadmin who isn't an electronic engineer.

--
- Oliver

The right to bear arms is only slightly less stupid than the right to arm bears...
Re:Fairly Unimpressive by Duketape · 2002-11-07 00:16 · Score: 1

... And word burst graphs should be discrete. You can't burst 1.5 word.

Andrew S. Tanenbaum by Kj0n · 2002-11-06 19:24 · Score: 5, Funny

... once wrote:
Never underestimate the bandwith of a station wagon full of tapes hurtling down the highway.

The latency is terrible, though.

Re:Andrew S. Tanenbaum by ThaReetLad · 2002-11-06 20:54 · Score: 3, Funny

Offtopic I know but for the ultimate in bad latancy try this link. An implementation of RFC-1149

--
You can't win Darth. If you mod me down, I shall become more powerful than you could possibly imagine

Re:Who cares? by jerkychew · 2002-11-06 19:30 · Score: 1

2 words: Battlefront 1942. That game kicks the crap out of my Athlon 1700+.

What do you drive? I'm willing to bet it's not a Geo Metro. But why not? Why do you really need anything faster? I mean, a Metro does the speed limit, right? Do you really need better acceleration and a higher top speed than what a Metro provides?

For those of us that strive for the best, the fastest computer is NEVER fast enough.

Re:Performance tip for software on modern processo by Anonymous Coward · 2002-11-06 19:35 · Score: 2, Informative

Sounds like a job for the compiler to me, and btw, you never have to wait on the cache. The trick is to query the cache and memory at the same time for a data item. If it's in cache, then the memory request will be cancelled, if it's not in cache, then memory goes just as fast as it ever would. Cache is truly amazing in that if you are using a write-through scheme, it only provides a boost to performance...there's no speed-size tradeoff at all.

I wouldn't worry about it too much by g4dget · 2002-11-06 19:39 · Score: 2

The tradeoffs that system designers make change constantly, and there are many other factors besides SDR/DDR that affect latency and throughput. Compiler writers also keep changing their minds about how they optimize and what cases they handle and don't handle.

The rules of thumb are pretty much the same now as they ever were: preferentially, access memory sequentially, and for non-sequential accesses, keep the accesses local; there are a bunch of programming tricks for that that work as well now as they ever did. If you can, use a hand-optimized, architecture specific library like BLAS. As a last resort, rewrite tiny bits of performance critical code in a language like Fortran 77, where the compiler may be able to do a bit more optimization than C/C++.

If a processor, compiler, or system architecture requires any more specific hacks to reach its stated performance, then for practical purposes, its performance is overstated. The only way to know is to run your code (or a set of benchmarks similar to your code) on it and see whether it runs fast enough.

This is more important to modern game optimization by The+Optimizer · 2002-11-06 19:39 · Score: 5, Interesting

I have worked on low-level systems for commercial PC games for over 6 years now.

When I started in the mid 1990's the current thinking about optimization among those who cared was all about reducing cycle counts, and paring instructions for a Pentium. Memory system and bus behavior was mostly ignored or assumed to be rendered irrelevant by on-chip caches.

During this time, while I was working on the graphics core for Age of Empires, I had lunch with Michael Abrash, who was at id software working on Quake at the time. While eating Mexican food, he casually mentioned the results of some memory bandwidth testing he had done and how he was shaping the rasterizer to make use of the time spent waiting on memory writes. This interested me enough to perform similar tests on my own work, and the results were telling.

I wound up with core rendering code that, if you used the conventional cycle counting wisdom of the time, appeared to be slower than what it replaced... but in fact was faster, especially for various effects processing. Both games had very large hand-written assembly software rendering routines, in the size 10K+ lines.

The reason for this of course was that memory bandwidth was being maxed out and with clever restructuring of code, it was possible to put the wait time to use on related processing, even if the code appeared to be more awkward and cumbersome that way. Though the exact memory behaviors would vary from system to system, one thing that was true and only got more so was that CPU speed was outstripping memory speed. Games like Quake and Age of Empires would have to process, in what usually amounts to a mutated memory copy, large amounts of textures or sprites each frame; so the data in question was pretty much guaranteed not be in the CPU caches.

You would think that with the current generation of games using Hardware 3D only, this issue would be reduced to upload speed across the AGP Bus, but if Age of Mythology is any indication, that's not going to happen. In Age of Mythology we were able to make some significant performance gains by using the same techniques of coding to make the most of the slower speed and latency of main memory.

As long the effort keeps paying off in increased FPS rates, we're going to be coding our games to account for and best deal with the realities of how the CPU relates to and waits on Cache and System memories.

Ultra 320 SCSI by Bullseye_blam · 2002-11-06 19:42 · Score: 5, Insightful

Yes, while the theoretical rate is much faster than PCI (as you noted), I believe that these cards are designed for 64-bit PCI slots, which you can see by the chart (which only lists fast/wide PCI) is 4x faster. A standard 64-bit slot running at 33 mhz (the speed at which most 32-bit slots run) is twice as fast as standard PCI.

So actually, Ultra-320 SCSI is the shit. ;)

It can be done better with self-modifying code by SexyKellyOsbourne · 2002-11-06 19:44 · Score: 0, Troll

By keeping all the code in the cache, inside the processor using self-modifying code in the most vital, inner loop areas of your program.

Modifying the code prevents the processor from reading further CPU instructions during the innermost loops through the slow bus, therefore giving a gargantuan speed boost.

Self-modifying code is just another programming tactic that we sacrificed for "ease of use" long ago, and it's been all but forgotten.

No CS student should be able to graduate with a degree without being able to write self-modifying code, but since very few can nowadays, here's a link:

http://reguly.net/alvaro/cic/linux-asm/self.html

Re:It can be done better with self-modifying code by Alien+Being · 2002-11-06 19:58 · Score: 2

Self-modifying code is just another programming tactic that we sacrificed for "ease of use" ...

and for quiche.
Re:It can be done better with self-modifying code by NeuroKoan · 2002-11-06 20:13 · Score: 2

Hah, we didn't get rid of self-modifying code for "ease of use", we got rid of it because we were all scared of Skynet.

--

"However," replied the universe, "The fact has not created in me A sense of obligation."
Re:It can be done better with self-modifying code by wik · 2002-11-06 20:20 · Score: 5, Informative

Self-modifying code is a horrible burden for the L1 caches. If you allow writes to code pages, the processor must treat the writes as data writes in the L1 D-cache. This means that there are now two different versions of the same cache line in the cache heirarchy, which means you need to keep them coherent. This means there has to be coherency between the L1 I-cache and L1 D-cache. Yuck.

It's going to take more than 1 cycle to keep those lines coherent, which is going to increase your average I-cache latency (and is exactly what you're trying to avoid). You really don't want to do this on modern processors. Besides, if your inner loop is big enough to thrash in your I-cache, you've got bigger problems (pun intended)... and if it's not big enough, you're not going through that slow memory bus, are you?

Bottom line: self-modifying code is a bad idea.

Second bottom line: Modern Java JITs end up doing this sort of thing, which gives computer architects a major headache!

--
/ \
\ / ASCII ribbon campaign for peace
x
/ \
Re:It can be done better with self-modifying code by g4dget · 2002-11-06 21:11 · Score: 2

Modifying the code prevents the processor from reading further CPU instructions during the innermost loops through the slow bus
And how do you think the new code gets into the cache in the first place? Whether the CPU loads it from memory or your program copies it from memory, most of it has to come from memory somewhere. But if you copy it yourself, it just gets there so much more slowly. Also, on many architectures, if you modify code, you have to synchronize the cache, which is very, very expensive.
The closest to self-modifying code these days is runtime code generation, as in JIT compilers. They win not because of cache effects but because they can generate code based on runtime information.
Re:It can be done better with self-modifying code by joto · 2002-11-06 21:37 · Score: 4, Insightful

Nope. Most modern processors have separate data and code caches. So when you write self-modifying code, the data-cache must be flushed to memory, and the code-cache reloaded. In the meantime, the pipeline will be stalled, and the instruction decoded (important for x86) must start all over again. Pentium Pro handles this automagically, other processors may need special tricks.
So self-modifying code is rarely important (and of course very hard to write/maintain). Code with dynamic compilation (e.g. jvm) is possible to write in a sane way, and can give potentially large speedups. Of course, this goes for C as well. Sometimes for an inner loop, it's better to write a C-program at runtime, compile it, and load it as a dynamic library instead of having lots of parameters to the function. Of course, that is much more heavyweight than what the JVM does. It would be nice to have a portable alternative. But actually modifying that code afterwards is really hard (and inherently non-portable).
Of course, there are some uses for self-modifying code that can be made quite safe, and simple to understand. E.g. Knuth's MMX uses self-modifying code to store the return address in procedure calls. (I believe that was quite a common thing to do when making FORTRAN compilers back then...).
On the x86, such tricks are relatively easy, because the x86 tends to almost always have instructions available where you can store a full 32-bit pointer/integer in the opcode (whereas most RISC architectures will not). But you will not get a speed benefit by using it, as explained above in the first paragraph.
Re:It can be done better with self-modifying code by joto · 2002-11-06 21:52 · Score: 2

I thought skynet appeared when they started to combine self-modifying code with persistence...
Re:It can be done better with self-modifying code by banana+fiend · 2002-11-06 22:13 · Score: 1

3 problems with self modifying code:

1: Notoriously hard to debug (I don't know anybody who can write a few hundred lines of asm code without making a mistake)

2: Slow :) .... remember there's a code cache, if you write to code, it'll flush it (unless you have a chipset that supports non-write-through cache such as PS2 (I've tried it))... and it'll have to flush it at some stage anyway. It's not a magic bullet.

3: We forgot it for a reason: Intel have branch prediction, which is a GODSEND - modify your code and you lose it for all those conditional jumps

I wrote some self modifying assembly code for a robot controller back in university CS - it was a bastard to debug because I hadn't meant to write self modifying code in the first place :)) - want to write to register 9 on an old motorolla chip?

--
Johns: Well, how does it look now? Riddick: Looks clear.
Re:It can be done better with self-modifying code by fitten · 2002-11-07 02:17 · Score: 1

...and throw lots of tricks that modern processors use to improve speed out the window. Many are already mentioned in replies to this. In effect, you will lose more performance than you will gain.

You also potentially throw away deterministic behavior, which can be an especially bad thing in certain application realms.
Re:It can be done better with self-modifying code by boots@work · 2002-11-07 06:32 · Score: 1

Of course, some versions of FORTRAN couldn't do recursive calls to a function, which would seem to be the problem with directly modifying the function's code to insert the return address.

I guess you could put a self-modifying trampoline on the stack containing the return address, but ... why not just store the address then?
Re:It can be done better with self-modifying code by joto · 2002-11-07 11:48 · Score: 2

some versions of FORTRAN couldn't do recursive calls to a function
Yes, that's why. But I really think it's still true, and not just true for some ancient pre-fortran.
I guess you could put a self-modifying trampoline on the stack containing the return address, but ... why not just store the address then?
I hope you are the only one who did pose that question. The rest of us would happily just store the return address :-)
Re:It can be done better with self-modifying code by boots@work · 2002-11-07 12:01 · Score: 1

I posed the question because I was trying to work out why the previous poster said
Knuth's MMX uses self-modifying code to store the return address in procedure calls.

So I suppose either they misunderstood/misremembered Knuth, or there is some positive aspect to it that Knuth can see and I cannot.
Or perhaps they were talking about MIX (the ~1970s processor) not MMIX (the modern RISC one)! Yes, now that I think about it that sounds like a more likely explanation -- it was probably just a typo. Self-modifying code for procedure returns might be much more representative of the old mainframe machines that Knuth was modelling in his original books. Given current programming languages and machine architectures (split I/D caches) it looks a bit perverse.
Re:It can be done better with self-modifying code by joto · 2002-11-09 00:35 · Score: 2

Yup, sorry, it was a typo.

A repeat duplicate article? by xintegerx · 2002-11-06 19:51 · Score: 1

That same content was previously a duplicate article... meaning that it's the third time here that I know of.

As in, on Slashdot at least 3 times in a short time.

--

Cover your eyes and click this link!

Why the compiler can't help you by MichaelCrawford · 2002-11-06 19:53 · Score: 5, Informative

I don't think you're going to be able to find a compiler that can reorder your struct or class members depending on how they are accessed. It may be possible to have one do that based on profiling, but I think that is beyond current compiler technology.

Also every compiler I have ever come across stores struct and class members in the order they are declared in the source file. I don't think that's guaranteed by either C or C++, but that's how it always is.

Also, the compiler is not going to make fundamental changes to your data structures and algorithms for you. If you write some code to manipulate a linked list, there's now way the compiler will change that to an array for you because it thinks it might be more efficient.

The one case I have seen tools able to affect cache access in a positive way is the use of code profilers that record the most common code paths in your program and then edit the executable binary so that all the less common code paths are towards the end of the file. Thus if you take an uncommon branch, you might jump back and forth a megabyte within a single subroutine.

Apple's MrPlus did that. It was based on an IBM RS-6000 tool whose name I don't recall.

This has the advantage not just of improving cache performance but of reducing paging - a greater percentage of the code pages that are resident in memory are used for something useful, rather than containing code that is mostly jumped over. Uncommonly used code will all be at the end of the file and may never be paged in.

One problem with a tool like this is that the results are only valid for a certain use of the program. If you have a program that can be used in many different ways, it may be difficult to find a test case that helps you.

--
Request your free CD of my piano music.

Re:Why the compiler can't help you by LordSah · 2002-11-06 21:01 · Score: 1

The compiler can't really reorder fields of a class/struct because the programmer could potentially address directly into the class without using the variable. There would be some trouble with that if the programmer couldn't predict where the data was going to be.
Re:Why the compiler can't help you by Anonymous Coward · 2002-11-06 21:33 · Score: 2, Informative

this is why the programmer should use the offsetof macro :-)
Re:Why the compiler can't help you by LordSah · 2002-11-06 22:58 · Score: 1

The preprocessor replaces the macro before the code is actually compiled. If your optimizer then reordered all the fields in the struct, offsetof screws up as well.

No re-ordering classes :)
Re:Why the compiler can't help you by fstanchina · 2002-11-06 23:15 · Score: 2, Informative

Also every compiler I have ever come across stores struct and class members in the order they are declared in the source file. I don't think that's guaranteed by either C or C++, [...]

Yes, it is.
Re:Why the compiler can't help you by Ben+Hutchings · 2002-11-07 00:40 · Score: 2

The preprocessor doesn't know what the layout of the structure is, and it doesn't have to. offsetof() is typically defined in <stddef.h> as something like:
#define offsetof(_T, _M) ((size_t)&((_T*)0)->_M)
which the compiler will evaluate based on the way it actually laid out the structure. But see my comment above.
Re:Why the compiler can't help you by Ben+Hutchings · 2002-11-07 00:45 · Score: 4, Informative

Also every compiler I have ever come across stores struct and class members in the order they are declared in the source file. I don't think that's guaranteed by either C or C++, but that's how it always is.

That's guaranteed to happen to a group of non-static member variables with no access specifiers among them. So for example in:

class foo { public: int bar; int baz; private: int quux; };

'baz' is guaranteed to be placed after 'bar' in an instance of class foo; but 'quux' might not be placed after 'baz'.
Re:Why the compiler can't help you by Anonymous Coward · 2002-11-07 01:50 · Score: 0

At least for C the only guarantee is that through a pointer you can access the first member of a struct as if it were the first memeber. So, given

struct foo { int bar; int baz; }

and

struct foo * ptr; /* pretend it's initialized */

you can use it in expressions like

(int) (* ptr) = 2;

and it will assign to the struct's bar element. It says nothing about the remaining items. You have to access those either normally or through the offsetof macro.
Re:Why the compiler can't help you by Anonymous Coward · 2002-11-07 03:59 · Score: 0

This is the hope of managed runtime systems like Java or .NET. Hopefully the JIT can do some profiling and reorganize the data on the fly to be more cache friendly.

Here's an idea...During garbage collection, why not add a cache optimization step to rearrange memory in a cache friendly way.
Re:Why the compiler can't help you by fstanchina · 2002-11-07 11:55 · Score: 1

That's right, you can have arbitrarily sized holes between struct members, but I'm pretty sure they are guaranteed to be stored in the order of declaration.

A platform agnostic point of view.... by Anonymous Coward · 2002-11-06 19:59 · Score: 0, Funny

This is so completely insensitive. I am a platform Athiest and I don't like having your "beliefs" forced on me.

I have no idea what this thing in front of me is, or how it got here, but I don't share your fucking beliefs. Just leave me alone, m'kay?

There are too many issues, and it gets too complex by Anonymous Coward · 2002-11-06 20:00 · Score: 5, Informative

There are too many issues, and it gets too complex quickly.

For example, a few syncronization commands , and eieio paranoia when not needed in drivers can slow down IO.

A good PCI-X capable Fiber Channel card on a mac can get 49 microseconds per complete genuine 512 byte IO (over 20,000 IOs per second) and thats per channel, but just a few mistakes in the hardware interrupt handler or cache coherency misunderstood paranoia can add many microseconds.

Even the fastest direct IDE cannot get speeds that fast (49 microseconds).

And SCSI 320 barely does.

But what about REAL WORLD, as we al know from the press releases of RC5 competition a standard mac g4 laptop was over twice as fast as Pentium 4 desktop units.

In fact, apple only sells dual cpu systems now, and the ones they sold in Feb 2002 got over 21,129,654 RC5 keyrate for dual 1.0 ghz macs.

The fastest AMD boards, dual cpu, no l3 cache available, get only 10,807,034 RC5 keyrate!

half for AMD

way less than that for Pentium 4.

Why? The Pentium 4 lacks a good 32 bit barrel shifter.(4 clock latency on left shift!)

Why the AMD is so slow? Perhaps because no L3 cache but the object code and data set of RC5 benchmark (get source yourself)fits in AMD L2 cache.

Cold memory random read and write is FASTER on macs than DDR machines as seen in benchmarks but this author does hit upon that topic indirectly a little. Even if macs in Feb 2002 were faster than AMD for scatterred random read and write, the current 3 desktop macs all use DDR ram now so probably lack speed boost for that action, but do have write agregate (combined writes) across pci bus and other tricks.

Macs also have a lot of other little advantages to offset thepenalty of huge RISC instructions... a great C-language way of programming the SIMD execution engine (called Altivec by Moto) and its SIMD is very good. Its SIMD has a few very minor assists to the RC5, but as experts have shown, removing them competently does not cripple apples speed much.

The fastest macs have alwasy had the fastest GENUINE IO.

In fact, copying data in 1992 was twice as fast to do for real using RAID, than copying to dev/null (nothing transferred) on a high end SUN!

People complained that dev/null was not optimized.

the truth is that commands that xfer data using cache controller tricks and not using cpu registers on macs help out enormously. Motorola 040 machines xfer 128 bit aligned dat 16 bytes per cycle using the strange and special cache controller command (trick) called Move16.

move16 made the sun servers look slow and silly, not the badly written dev/nul.

in 1995 I saw with my own eyes 6 Seagate ST 12450W drives (each had two heads per surface very very rare drives) transfer almost 65 megabytes per second sustained on a high end mac.

that was 7 years ago, and the fastest PC for all the money you had with the fastest adaptec controller you could find and the best raid was : LESS THAN ONE FIFTH AS FAST.

And now in 2002 you have people endlessly worrying about AGP and PCI-X without understanding those are OUTPUT tweaks not INPUT sppedup tweaks, and people trying to speed up streaming speed of ram faster and faster without realizing that speed of L! and L2 cache are Key.

Or ability to SHARE the L2 cache amoungst multiple cpus.

The hiddedn "backside only" cache of Pentium 4, and older macs, is the reason you could only have one cpu.

having two fast, low voltage, high speed cpus or more is key to performance in 2003.

you cannot do this with Pentium 4, you need to use expensive xeons if you want 2 intel chips on one board, else use pentium 3.

And pricewatch this week shows a 800 Mhz itanium from intel (base model now) at over 7 thousand dollars.

7 thousand! no wonder 6 or 8 box vendors dropped plans to use itanium this year. Geeeez.

FAST L2 and L3 cache is where its at.

The latest mac cpus to come out in a couple months (not the Power4 based ones in august), the moto ones, will allow 4 megabytes of L3 cache instead of 2, and have a staggering 512K of L2 cache running at 1 ghz, instead of 500 Mhz.

I did not even think that was possible in todays world.

feeding a rick chip is harder than intel, because the data code cache only holds half as much logic with the wasteful 32 bit opcodes, but the ALIGNED data, the sweet wonderful mac world ALIGNED DATA help the mac enormously.

There is no "PACK(1)" prgma for c structures on a mac.

I am not kidding.

Its not part of the mac experience.

True, many fields are 2 byte aligned instead of 4 byte aligned at times, but since 1995 apple has stressed 32 bit aligned integers and 64 bit aligned qauds religiously.

Macs perform well because of ALIGNMENT of structures.

Do archetecture people understand how many obscene PACK(1) (8 bit aligned) structures there are in Win32?

do they even code on multiple systems?

I do. If you use a 64 bit integer that is 2 byte aligned on a Pentium and pass it as argument to MS Win32 it will silently fail in some of its timer routines. That never happens on a Mac, plus mac routines tend to paranoia check a little more often on input, but not always.

multiple registers helps a coder
multiple registers helps assemly coders avoid push-pop hell

people need to think about those things too before wasting time religiously bragging about high end streaming speed of RAM.

ever timed REAL IO? Real IO pumped from card to card faster using good DMA back-to-back faster than could ever be moved using conventional single registers?

architecture is all about asking why?
Why use floppy disks in 2002?
Why use big hot parallel printer connectors in 2002 or ever ( IBM CHRP ref spec demanded it on hand helds!)
(IBM "PREP" spec required centronics connector on handhelds too!!!, MS Win 95 spec insisted on it strongly, but said SCSI was not highly important)

Why use ISA in 2002?
Why use hot hot steamy chips that do lots of speculative branching eating up power? Apples fastest machines use microcontrollers. I kid you not. They are using MICROCONTROLLER cpus with very very shgort pipelines and very very little speculative branching and very low power requirements

Why use PS2 keyboards?
Why insist on VGA at boot?
Why insist on legacy BIOS calls that have no relevence except for anciet OSes taht are not even guranteed to run by motherboard vendors?

I respect legacy too, but the legacy of Apple spurned all of these in 1984. Yup. macs never had any of that slop, though they do have open-boot style pci, and now use vga style connectors (though the connectors have detect diaodes in them to see waht size monitor tyou have), and have IDE now as default drive, though very fast performing vs pci bus contention. In fact apples 14 drive server uses 14 IDE controller chips for each of the 14 IBM GXP120 gig drives. 14 chips! 14 masters! Each pumping 35 megabytes sustained or more, and for only 15,000 bucks with fiber channel. Unfortunately its a 3U, but the drives are cold.)

I think its funny that people try to write papers yapping about things that can change rapidly in one or two years, or have little bearing on true io speeds.

The sad truth is that right now... RIGHT NOW... in 2003 November not ONE motherboard on pricewatch or for sale that I know of supports PCI-X, except for rich-man XEON and rich-man itanium.

NONE.

No Pentium 4 with PCI-X, no mac (though apple X-Serve is 488 megabytes per second per slot), and no MP AMD and no AMD thunderbird class.

Just vapor-hardware and promises for 3 straight years.

Now AMD said they will give fast PCI only to Hammer chips and hammer chips are getting horrible benchmark speeds.

Does anyone reaslize how pathetic PCI slots are in 2002?

I have in my machine 3 different pci-X cards and i have to run all of them at slower speeds even though some are capable of 770 megabytes per second bidirectionally (in-out simultaneous), at 133Mhz.

This world sucks.

And RAM? Don't make me laugh! Try to find an AMD board that takes 4 gigabytes of RAM and USES it as fast as the fastest AMD can. every tweaker site says you can only use one 512MB part and have a max of 512MB.

Thats insane. i have not one machine with less than 768 MB in this house and my main mac from 1995 supported and allwoed a single user proccess to hold and lock (physical real ram) 1.5 MB of memory.

In 2002 no linux with any normal tweak allows a user task to hold and lock 1.5GB of reeal ram, its all virtual or fake.

Even most UNIX never allow more than 3 GB of physical REAL RAM in total usage ever... its all wasted for bad VM designs.

nobody cares. Everyone says "I know 7 different unix OS that support 4 GB of ram" and then you have to reming them that VM is not RAM and physical RAM can be easily proven to be there or not and that no intel unix allows tasks to utilize 1.5 gigabytes of real physical RAM normally. And even if netbsd is hacked it runs no shrinkwrapped software. all shrinkwrapped software is mac or windows.

thankfully apple is migrating to 40 bit address space physically soon in august with the new lightweight Power4.

does anyone think that this nightmare of not physical ram in osses is real problem or not?

sure NT has a /3 3 gb switch and another version allows bank switching 16 GB of ram slowly, but no NT system allows a sungle process to utilize over 1 GB of real physical genuine RAM (critical for FTDT 3d energy simulations).

Arrrgh! I hate all this least common-denominator lowest-cost-component world.

Fake powersupplies that lie about ratings over 450 watts

cheap-ass capaciters that heat blow and leak beacuse tantalum costs too many extra cents

traces that corrode instantly in salt air near ANY coast, especially in florida

fans that silently die and expensive fans doing the same

drives that have 34% failurerates after 18 months of usage (Fujistu lawsuit, IBM lawsuit)

And to think that people try to make themselves feel good that they can move memory from one area to another quickly using ram streaming commands. BIG DEAL! Try moving it to a disk drive ro through a network connector or to another CPU. (many multi cpu designs cap inter cpu speed to 50% or 25%).

who cares about ram streaming! bus contention, pci latency, and cold ram jump reading are far more critical issues.

But no one cares. They just want to download mp3, porn, dvdrips, and console warez and you can do that on any 5 year old box.

What a terrible world when a hard drive from seagate in 1995 allowed 12 megabytes per second SUSTAINED and in Nov 2002 the fastest single spindle drives sustain only 39 MB per second or so.

What garbage.

And the PCI bus is not 50 times faster after all these years, or 40 times, or 20 times faster, or 10 times faster, its so slow even at 64 bit, 64Mhz I want to just cry.

I'm no democrat. The demos are corporate whores by MichaelCrawford · 2002-11-06 20:02 · Score: 0, Offtopic

I voted Green, thank you very much.

--
Request your free CD of my piano music.

Re: your sig by Myco · 2002-11-06 20:02 · Score: 1, Offtopic

Never put salt in your eyes.
Never put salt in your eyes.
Never put salt in your eyes.
Never put salt in your eyes.
Always put salt in your eyes.

AAAAAAUUUGH!!!!

God I miss kids in the hall. Thanks for the reminder.

--
My deviantArt site

One of the cause of the latency by jsse · 2002-11-06 20:07 · Score: 0, Offtopic

are the Netbios requests from those Microsoft Windows servers.

I got about a hundred netbios requests from all over the world to each of our servers attach to Internet, everyday. No one seems to take this problem seriously. Oh well...

Re:One of the cause of the latency by Anonymous Coward · 2002-11-06 20:32 · Score: 0

1. What the hell are you talking about?
2. What idiot modded this as "informative"?

Self-modifying code is bad... by Goonie · 2002-11-06 20:16 · Score: 4, Informative

This is obviously a troll, but seeing it's been moderated up I should warn the kiddies out there that this is a *bad idea*...

Self-modifying code makes pipelining, branch prediction, instruction cacheing (particularly on SMP systems) and a bunch of other things dangerous, and just slows down the processor as it checks for and deals with it. IIRC some architectures don't even explicitly check for it anymore and die horribly if you try it.

Aside from the fact that trying to debug self-modifying code is just asking for fscking trouble....

--

Any sufficiently advanced technology is indistinguishable from a rigged demo
--Andy Finkel (J. Klass?)

Caches old tech by Goonie · 2002-11-06 20:21 · Score: 3, Interesting

Caches have been used in mainframes and minis since 1969, when the IBM 360/85 used it for exactly the same reason as modern CPUs need cache - the low-cost memory technology of the time (magnetic cores, IIRC) were much slower than the CPU, and memory that was fast enough was expensive.

--

Any sufficiently advanced technology is indistinguishable from a rigged demo
--Andy Finkel (J. Klass?)

But self modifying code is fun! by MichaelCrawford · 2002-11-06 20:23 · Score: 3, Informative

We couldn't have viruses without self-modifying code. What would all the teenagers do?

Used with care, self-modifying code is a powerful and useful tool. And yes there are caching issues - most processors have separate data and code caches, so writing into code using data instructions will put the code into the wrong cache, so you have to flush it.

We couldn't have program loaders without self-modifying code!

A number of the products I wrote for the Mac back at Working Software were self-modifying code, and they did very well.

You just have to know what you're doing, that's all.

Another use for them is dynamically relinking a running program as you edit its source code. Instead of relinking and relaunching the whole program, you can just reload the last subroutine that you edited. This is done by a number of development environments, and can greatly speed up the edit-compile-debug cycle.

--
Request your free CD of my piano music.

Re:I'm no democrat. The demos are corporate whores by Anonymous Coward · 2002-11-06 20:33 · Score: 0

Actually, the Dems are union and trial lawyer whores.

The Greens are just pushing Euro-socialism. It's a pity the left hasn't had any new ideas in the last 50 years. Hell, if it weren't for the Republicans they wouldn't have even been able to take credit for the civil rights movement.

Re:Performance tip for software on modern processo by _ph1ux_ · 2002-11-06 20:34 · Score: 3, Insightful

This is why companies like intel have whole departments dedicated to getting people to write software that is optimized for whatever new features are available on a new processor.

When I worked there - we ran the DRG Game lab - which was for getting game developers to optimize their code to take advantage of new instructions etc on the latest processors.

This made the processors look better, any game that we tested that ran better on the processors after having the code optimized was pushed out with a big marketing hoopla and Intel would say "HEY! come look at our new machines - look how great X software title runs on the latest and greatest"

But the truth is that this was pretty much all fake - as rather than testingthe software on the exact same boxes that had just two different processors - the tests were done on boxes that had totally different configurations - although we never told anyone about that littel detail.

Re:I'm no democrat. The demos are corporate whores by MichaelCrawford · 2002-11-06 20:42 · Score: 0, Offtopic

The main reason I don't vote democratic is that they are cowardly and do not present any kind of real oppositiion to the republicans.

Wasn't the vote for the DMCA unanimous? Lots of democrats voting for corporate interests. Look at the large number of democrats who voted for Bush' resolution on war in Iraq.

I feel that the democratic party has outlived its usefulness and needs to be replaced by a party that will live true to its ideals.

I see some chance that the democratic party will survive and I might vote for them again - if they lose enough membership to the greens that they decide they must change, develop some spine and put up a real opposition to the republicans.

--
Request your free CD of my piano music.

Re:There are too many issues, and it gets too comp by jericho4.0 · 2002-11-06 20:43 · Score: 2

Very interesting AC. Get an account, please, we need more like that.

--
"A language that doesn't affect the way you think about programming, is not worth knowing" - Alan Perlis

Agreed by Anonymous Coward · 2002-11-06 20:51 · Score: 0

The main reason I don't vote democratic is that they are cowardly and do not present any kind of real oppositiion to the republicans.

I agree. Liberals are a bunch of pansies.

Marketing & Intel VTune Performance Analyzer by MichaelCrawford · 2002-11-06 20:53 · Score: 1

Intel VTune Performance Analyzer is an impressive code profiler, and can even profile Linux code (over the net, with the UI hosted on Windows), but Intel's marketing shows through clearly in the advice it gives you on how to optimize your program - by making use of assembly opcodes that are only available on Intel processors, and only the very latest ones at that.

I haven't tried, but I would be surprised if VTune ran on an AMD processor.

For the very fastest code, you can take advantage of special instruction, write stuff in assembly with the clever use of registers, etc. But the performance gains won't be portable.

Optimizing cache use could be considered a non-portable optimization, but it can be done directly in C or C++, and any processor most people are likely to use will use a cache. There will just be some variations in its size, the size of a cache line and stuff like that.

--
Request your free CD of my piano music.

Re:Performance tip for software on modern processo by Mike+McTernan · 2002-11-06 21:01 · Score: 1

If you use some memory, use it again right away

That type of memory is called a 'register' in the CPU. The compiler will perform the optimisation you describe using these 'registers' for you.

--
-- Mike

Calculating Latency by SailorBob · 2002-11-06 21:07 · Score: 5, Informative

From:
Ace's Guide to Memory Technology

Basically, the latency of the whole memory (From FSB to DRAM) system is equal to the sum of:

The latency between the FSB and the chipset (+/- 1 clockcycle)
The latency between the chipset and the DRAM (+/- 1 clockcycle)
The RAS to CAS latency (2-3 clocks, charging the right row)
The CAS latency (2-3 clocks, getting the right column)
1 cycle to transfer the data.
The latency to get this data back from the DRAM output buffer to the CPU (via the chipset) (+/- 2 clockcycles)

This gets you the first word (8 bytes). A good PC100 SDRAM CAS 2 will have a latency of about 9 cycles, and the next 3 cycles another 24 bytes will be ready. The PC100 SDRAM will, in this case be able to get 32 bytes in 12 cycles.

If you want to calculate the latency that CPU sees, you need to multiply the latency of the memory system with the multiplier of the CPU. So a 500 MHz (5 x 100 MHz) CPU will see 5 x 9 cycles latency. This CPU will have to wait at least 45 cycles before the information that could not be found in the L2-cache will be available in the cache.

--

Woopty Doo Basil, what does it all mean?!

Re:Calculating Latency by vadim_t · 2002-11-07 02:28 · Score: 1

Wouldn't light speed be an issue too? I did an estimation a while ago, and at 4 GHz a clock cycle should be lost while waiting for the signal to travel over the wire. Sure it's not a huge difference if you add it to the 45 cycles above, but as CPUs get faster it'll grow.

What if you don't have that many registers? by MichaelCrawford · 2002-11-06 21:08 · Score: 1

Yes, it's more efficient to keep reusing registers for the same data, but the cache can store considerably more data than the registers can.

--
Request your free CD of my piano music.

Lame article, by Ars standards by RockyMountain · 2002-11-06 21:18 · Score: 3, Insightful

How can an article about frontside bus and memory latency entirely ignore the concept of request pipelining? Huh?

And why all that complex hand-waving about practical upper limits to burst length. He gave all kinds of secondary limiting factors, but missed the obvious one: How about the simple arguement that long bursts are useless unless you have a reasonable expectation that the speculativly fetched portion of the data will be consumed. Moving lots of data fast is only useful if a substantial fraction of it is data that you care about.

(It's the same reason that there's an upper bound on the useful cache line size.)

hey! look at the bright side by tanveer1979 · 2002-11-06 21:20 · Score: 2

Latency apart, you can be assured that it will never be slashdotted.

--
My Aurora : http://www.youtube.com/watch?v=o91ZsGwJYyg
FB : https://www.facebook.com/TanveersPhotography

Re:hey! look at the bright side by commodoresloat · 2002-11-06 22:13 · Score: 2

You could slashdot it with a truck headed in the opposite direction.
Re:hey! look at the bright side by ceswiedler · 2002-11-07 05:49 · Score: 2

I think a more appropriate visualization would be a million monkeys jumping onto the car and begging for the data within.

Re:Performance tip for software on modern processo by kylef · 2002-11-06 21:24 · Score: 1

but on the other hand you won't have to wait to fill a 32-byte cache line each time you read a single item.

Please forgive my failing memory, but isn't the functional unit requesting the load notified immediately when the requested word is available from the load/store unit? Unless I am imagining things, I seem to remember this procedure:

word load is encountered in the code
all functional units requiring the result of this load wait for load/store unit to succeed
load/store unit fires off request to cache controller, which continues the request up the memory hierarchy until word is found
requested word is returned immediately down the hierarchy to the load/store unit where it is directed to the appropriate file register (or data bus a la Tomasulo)
functional units waiting for load to finish proceed, while simultaneously the cache hierarchy loads the rest of the cache line for that word into the cache

Wouldn't it be a silly implementation that forces the load/store unit to wait for the entire cache line to be read before returning the requested word?? In other words, doesn't the memory hierarchy bring the cache lines in "in the background" while the requested data is returned to the load/store unit? And wouldn't this mean that turning the cache off doesn't solve "cache line latency" since it doesn't really exist to begin with?

How Does Increasing FSB affect Performance? by SailorBob · 2002-11-06 21:26 · Score: 3, Informative

More From Ace's:

Athlon XP 2800+: 333 MHz FSB and nForce 2

First of all, we tested the Athlon XP 2800+ on the "normal" KT333 platform with a 17x multiplier, the FSB set at 133 MHz DDR (266 MHz) and the memory set at 166 MHz DDR (333 MHz), CAS at 2, RAS to CAS at 3, Precharge at 3. The second time, the KT333 platform (ASUS A7V333) was set at a FSB of 166 MHz (333 MHz) and the multiplier was set to 13.5x.

...

Where do I start? There is an enormous amount of info hidden is this table. Let us first start with the 266 MHz versus 333 MHz FSB discussion.

There have been many reports that show that the Athlon does not benefit much from an increase in FSB clockspeed, moving from 266 MHz to 333 MHz. But Membench tells us exactly why. First of all, compare the two KT333 latency numbers (64 byte strides). All BIOS settings were exactly the same, only the FSB speed, and thus the multiplier, are different. Normally one would expect, everything else being equal, that the Athlon with the 166 MHz FSB would see 25% lower latency, but the CPU with the 166 MHz FSB version actually sees a higher latency! This shows that the (ASUS) KT333 board, in order to guarantee proper stability, increases certain latencies of the memory controller. Memory bandwidth increases by 14%, which is also less than expected.

Now what does this mean for "real world" performance? It means that many applications will see either a very small performance increase or none at all, as it is latency and not bandwidth that is the most important performance factor. Let us explain this in more detail.

--

Woopty Doo Basil, what does it all mean?!

Re:How Does Increasing FSB affect Performance? by EmagGeek · 2002-11-06 22:56 · Score: 3, Interesting

Now what does this mean for "real world" performance? It means that many applications will see either a very small performance increase or none at all, as it is latency and not bandwidth that is the most important performance factor. Let us explain this in more detail.
The real world scoop on this is that someone typing a document in OpenOffice or surfing the internet won't see any performance increase over a Pentium-II 233 MHz machine with 64MB of 60ns RAM. Gamers might get 46 gajillion frames per second instead of 42 gajillion frames per second, which is completely indistinguishable to humans, so they won't notice either
I, on the other hand, might see a simulation of a 5250-node electromagnetic scattering problem take 36 hours instead of 39 hours, which is quite significant. But, I would probably get the same increase in performance by going through and cleaning up my code a little. FORTRAN is funny that way...
Making computers faster to the nth power only makes code that's worse to the n+1th power :)
Re:How Does Increasing FSB affect Performance? by mcrbids · 2002-11-07 07:32 · Score: 2

Eh, your sigline has a bug in it!

It should read: Making computers faster to the nth power only makes code that's worse to the (n+1)th power :)

Otherwise, what you have is n+(1th) power...

--
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Re:How Does Increasing FSB affect Performance? by EmagGeek · 2002-11-07 08:21 · Score: 1

Oh Jesus Christ... you get the point, don't you?! :)
implicit none
integer i*4
real computer_speed
real adequate
real speed_base

do while(1)
call adequate(speed_base)
if (computer_speed .gt. speed_base) then
do i = 1,1E9
enddo
endif
enddo

Re:Yes, you're right by MichaelCrawford · 2002-11-06 21:40 · Score: 1

All spout and no water.

Really? Have a look at my resume.

--
Request your free CD of my piano music.

Re:Performance tip for software on modern processo by AlecC · 2002-11-06 21:51 · Score: 2

Sometimes registers don't cut it. For example, you cannot pass a pointer to a register. If you have a finction with several disparate return parameters, you might well pass pointers to the places to put the returns. And they, obviously, you are going to use the returned data straight away - else why did you ask for it. So you group the ints into which the values will be returned in order to take advantage of the cache if you can.

--
Consciousness is an illusion caused by an excess of self consciousness.

The bliss of SDRAM banking by gwappo · 2002-11-06 21:54 · Score: 3, Interesting

What I couldn't find in the article is that it is possible to reach the maximum rate to SDRAM as an SDRAM chip has multiple banks - eg. one can issue a command to one bank, while still receiving data from another.

This avoids incurring latency since read commands can be issued in parallel to incurring the cas latency.

There's more details on this in the SDRAM specification (lost the URL but it's out there - I think its Intel who wrote it though).

Re:Bandwidth - Useless Without Latency by SailorBob · 2002-11-06 22:05 · Score: 3, Informative

You're making the classic mistake of assuming that Bandwidth is what matters, when in fact it's the latency which kills applications. Having bandwidth numbers without the corresponding latency is worse than useless, it's misleading for the uninformed.

--

Woopty Doo Basil, what does it all mean?!

Read this by chthon · 2002-11-06 22:32 · Score: 2, Informative

Everyone who is interested in issues of bandwidth and latency should read this book :

Computer Architecture : A Quantitative Approach, by John L. Hennessy and David A. Patterson

Re:Who cares? by Tim+C · 2002-11-06 22:51 · Score: 4, Insightful

How many of you have found computer speed to be a major impedence on your productivity?

I have.

I write server-side Java code for a living at a web agency. Until recently, I had a P3 450 with 384meg of RAM, and it was too damn slow. I develop in JBuilder, and deploy my code using Resin (see caucho.com), and for complex sites it could take literally minutes for the server to start, then a couple more minutes per page in debug mode for the pages to be parsed and compiled. You're looking at 10-15 minutes to check to see if a one-line bugfix has worked, and hasn't had any unexpected side-effects, etc. That's 10-15 minutes of waiting, waiting, waiting, clicking a link, waiting, waiting, waiting, entering some details and hitting submit, waiting, waiting....

Extremely frustrating, especially if you're working late, especially when you know that 30-45 minutes spent going to the nearest high-street electrical retailer will buy you a machine 4 times as fast.

That's all before we even get on to the responsiveness of JBuilder...

I now have a P4 1.9GHz with 3/4gig of RAM, and the difference is incredible. JBuilder is much more responsive (feels like native code rather than Java most of the time), compile-run-test-debug cycles are much reduced, etc. This has a knock-on effect - people are generally happier and less fresutrated, stress levels are lower, there's less swearing (one or two of my colleagues regularly vented steam by hitting their PCs and swearing loudly) - work is all-round more enjoyable.

The bottom line is that anything that means that I spend less time waiting for things to compile, or startup, or whatever, is a good thing. Do not make the mistake of thinking that you'll never have a use for all that power; you'll find a use. The laser, for example, was sat around in research labs for years before anyone thought of anything practical to do with them. Now, practically every PC has at least one, not to mention hi-fis, DVD players, etc.

--
It's official. Most of you are morons.

Re:never trust the back of the box-confusion by mikewhittaker · 2002-11-06 22:54 · Score: 1

This must have left CDC6xxx programmers confused,
what with 18-bit address registers and 60-bit operand registers, and 6-bit characters ...

Re:There are too many issues, and it gets too comp by brunes69 · 2002-11-06 23:41 · Score: 3, Funny

Wow it sure is nice,

To be able to read a 2 page long comment.

Especially when it wold only normally be a small paragraph.

Except that the author thought that it wasn't long enough.

So they typed it like this,

And made everyone hate them.

What kind of bandwidth and latency? by hellstorm · 2002-11-06 23:55 · Score: 1

It could be useful for the people who can't dedicate much time to read /. to state clearly at the headline what the article is really talking about. opcode latency? memory latency? network latency?

--
--------------------------------------------------
Programming is good for health

Re:There are too many issues, and it gets too comp by Anonymous Coward · 2002-11-07 00:15 · Score: 2, Funny

Considering each line is a different idea you fool, you would have a 10 page article if each line was expanded into a eloquent paragraph. Additionally, with requisite sentences crafted at the beginning and end of each paragraph over 30% would become filler. If you understand the prinicples of reading, you will know based on psychological testing of comprehension and legibility speed that horizontal sentences with whitespace above and below are rapidly read in 8 word clusters by people with high IQs. I bet you are one of those anal fools that used text mode white-on-black background fixed point MS-DOS text through the 1980 and 1990s while every one else went macintosh style modern fontography and legibility. In reality you are a closet mac bigot and hate yourself for not knowing anything concrete to criticize except poking fun at the extra linefeed characters to seperate the countless seperate topics in the post. Did you ever think for a moment, that perhaps the extra line feeds were placed there DELIBERATELY just to provoke people similar to yourself. Well its probably the case, so the intention and effort reached their mark well. Long live AGIT-PROP!

Re:Who cares? by dincubus · 2002-11-07 00:20 · Score: 1

well i for one would like to be able to keep up on the newest and fastest. this is because of the types of courses and types of work i will have to do for my homework. i am goign to be doing alot of CAD work among other things. and we all know that is pretty memory and cpu intensive. so a faster bus is good... to quote campchaos.com (infamous metallic vs napster shockwave animations ) "fast good.. sloow bad"

--
a wise man once said "two wrongs dont make a right, but three rights do make a left" and that wise man was gallagher

Old sayings... by fitten · 2002-11-07 02:02 · Score: 1

Old sayings about bandwidth:

- Never underestimate the bandwidth of a station wagon filled with magnetic tape.
- What's the fastest way to get 1 TB of data from LA to NYC? FedEx.

We can translate that to modern terms but the idea is the same. Just because bandwidth is high, doesn't mean that the latency is low.

Re:never trust the back of the box-confusion by fitten · 2002-11-07 02:08 · Score: 1

...or PDP-8 =)

IIRC... The i386SX had a 16-bit external data bus. The Motorola 68000 had a 16-bit external data bus as well (and 32-bit registers). The 68008 was a 68000 with an 8-bit external data bus. IIRC, the HP48 family of processors were 64-bit internally but 4-bit external data bus (to save power and space).

So old it clunks by panurge · 2002-11-07 02:15 · Score: 1

This article could have been summarised in about 400 words, in fact I would do it myself if I hadn't got a deadline to meet. This is old, old stuff. So here comes a boring oldtimer bit of information.

In the distant past, embedded systems used eproms that were rather slow, so memory access needed several wait states - the author doesn't seem to know this ancient term - while the eprom went "duh, that's address #F0F0, better go back in the stores and find the data". So as soon as fast RAM was cheap enough we would load the eprom contents into RAM at power up (or at least the frequently accessed bits) and then run from RAM where no wait states were needed. This was usually a 50% performance boost without changing the processor

And there you have it. Substitute L1 cache for fast RAM and dram for eprom, and despite the fanciness of the modern technology, and the enormously bigger memory space, nothing has really changed.

--
Panurge has posted for the last time. Thanks for the positive moderations.

My favorite old saying by aridg · 2002-11-07 02:25 · Score: 1

There are a few different versions of this (google comp.arch to see some examples):

"Money can buy bandwidth, but latency is given by God"

(You can always increase bandwidth by adding more bits, but the speed of light is fixed...)

PCI-X by alder · 2002-11-07 02:33 · Score: 1

The 4 Ultra-320 SCSI products Adaptec listed:

All are PCI-X, i.e. 64-bit 133MGh PCI, which according to the chart, sustain 1.06 GB/s. IMHO it (the PCI-X) still has some bandwidth to spare after an Ultra-320 board is in...

Re:Who cares? by Anonymous Coward · 2002-11-07 02:45 · Score: 0

Is that true on mondern pipelined processors? Sustained floating point performance for simple operations (addition, multiplication etc) should be pretty close to one operation per cycle per functional unit. Transcendental functions are slower of course. But still, I would expect memory to be the bottleneck.

latency measurement by Anonymous Coward · 2002-11-07 03:02 · Score: 1, Informative

Use lmbench: www.bitmover.com/lmbench
It will tell you what the real latency of your
system is, including the cycles inside the CPU
before the processor drives the request on the bus.

split transaction busses by Anonymous Coward · 2002-11-07 03:05 · Score: 0

It really is too bad he didn't explain split transaction busses: processors can have multiple outstanding requests on the bus, which allows the memory subsystem to return requests more efficiently.

Processors also generally have split address/data busses, so the processor can start a new request while the data is being returned.

Re:Performance tip for software on modern processo by photon317 · 2002-11-07 03:05 · Score: 3, Funny

The linux kernel guys pay attention to these thigns and code for them by hand. Hence their badass performance :)

--
11*43+456^2

Wouldn't that be a `smashdotting'? by leonbrooks · 2002-11-07 03:24 · Score: 2

Sorry, couldn't resist that pun.

--
Got time? Spend some of it coding or testing

B.G.A.T. ****TROLL ALERT**** by Anonymous Coward · 2002-11-07 03:27 · Score: 0

B.G.A.T. (Billy Goats Against Trolls)is proud to announce that SexyKellyOsbourne has made our most wanted list. Normally it is pretty hard for us to prove our case against such people. But Ms. Osbourne has taken special care to ensure that the world knows she is a troll. Example #1 Right from her own journal. As much as B.G.A.T. would like to take credit for this, it does all come right from the trolls mouth!That one wasn't enough to convince you. How about This one? And then there is this one. She has also taken a moment to tell her something about herself. A quick glance at her posting History tells it all. Here is one of my favorites. So please take this time to spend just one mod point to keep this genital wart on society out of sight. MOD HER DOWN AS A TROLL!!!! Not because I said so, but remeber she is a self confesed troll.

Re:Who cares? by YetAnotherDave · 2002-11-07 03:29 · Score: 1

me too

the guys with the newest computers always kick my ass at counterstrike over lunch break,
and that just ruins my productivity for the afternoon.

Re:There are too many issues, and it gets too comp by leonbrooks · 2002-11-07 03:34 · Score: 3, Informative

In 2002 no linux with any normal tweak allows a user task to hold and lock 1.5GB of reeal ram, its all virtual or fake.

False for not-pretend 64-bit architectures (e.g. UltraSparc) and has been for years.

--
Got time? Spend some of it coding or testing

Re:There are too many issues, and it gets too comp by leonbrooks · 2002-11-07 03:43 · Score: 2

fans that silently die and expensive fans doing the same

Really sad thing is: we could get by without those fans at all. Run the CPUs a few percent slower and use non-power-hog architectures, put a real heatsink on the PSU instead of toys and a blower, put multiple heads in the drives instead of spinning them faster (or better still, install more RAM so the disk gets hit less often).

And seal the case up completely. No corrosion problems - with optical connections and batteries (machine consuming a fraction of the power that your desktop P4 does) you could in theory take your computer swimming. What would you call a mouse that operates in water? An eel?

And how about `level 5' cache: buckets of slower, low-power, low-cost RAM for swapping, temp files, disk cache etc?

--
Got time? Spend some of it coding or testing

Re:There are too many issues, and it gets too comp by Anonymous Coward · 2002-11-07 03:52 · Score: 0

Macs are VERY slow compared to x86 processors for general purpose code. Get over it. Look at the SPEC 2k scores for the AMD and Intel processors. Hmm...Apple doesn't submit SPEC scores...gee I wonder why? In reality, PPC processors are about 2 years behind AMD and Intel in general purpose processor performance.

The only reason the PPC does well on RC5 is that RC5 is completely dependent on dependent chains of Rotate instructions. PPC happens to include a SIMD version of this instruction and is able to do a bunch in parallel as this algorithm is embarassingly parallel. This instruction in completely useless in most other data parallel applications, so AMD and Intel decided it wasn't worth it to include a SIMD version of this instruction (Parallel shifters cost power and area)

All of the data and code for this algorithm fit in even the smallest caches (like P4's 8k L1).

Re:This is more important to modern game optimizat by ivrcti · 2002-11-07 04:45 · Score: 1

I play Conquerors nearly every day. I love it. Thanks for your hard work. Keep it up !!

Re:There are too many issues, and it gets too comp by Jay+Carlson · 2002-11-07 04:47 · Score: 5, Interesting

Here we go again. I really don't have all day to poke holes in this, and because I'm actually trying to cite and verify I'm going to completely miss the moderation window, and lose readership. While some of the claims are correct, don't assume I agree with any of them just because I didn't refute.

A good PCI-X capable Fiber Channel card on a mac [...]

There are no Macs that support PCI-X. I am therefore suspicious of the numbers you claim for this configuration.

Next, RC5. The rant here seems similar to another Anonymous Coward post back here; I'm not going to copy in my response again; quick summary: I didn't buy my computer to run RC5 really fast, and neither did you.

Cold memory random read and write is FASTER on macs than DDR machines as seen in benchmarks but this author does hit upon that topic indirectly a little. Even if macs in Feb 2002 were faster than AMD for scatterred random read and write, the current 3 desktop macs all use DDR ram now so probably lack speed boost for that action, but do have write agregate (combined writes) across pci bus and other tricks.

This paragraph is confused. Yes, "cold start" memory latency is very important for many tasks, and is often overlooked. But how is the first sentence be true when many Macs are DDR machines? And where are these benchmarks? I just went looking for DDR Mac latency scores and couldn't find anything. Does anyone have lmbench memory latency numbers for the Xserve or the current PowerMacs? Oh, and write combining is hardly a Mac trick.

The hiddedn "backside only" cache of Pentium 4, and older macs, is the reason you could only have one cpu.

Incorrect. You just need a cache coherency protocol between your processors. "Backside" has nothing to do with it. For example, the dual-processor Pentium III box I'm typing this on has "backside" cache on each processor; it's just hidden inside the CPU packaging rather than brought out to extra pins to connect to an external cache.

There is no "PACK(1)" prgma for c structures on a mac.

struct foo { char c; int i; } __attribute__ ((packed)); struct foo foo_inst; main() { printf("%d\n", (int)&foo_inst.i - (int)&foo_inst); }

happily returns "1" on 10.2. In fact, if i doesn't cross a double-word boundary, there is no penalty for use on later CPUs. Yes, I just verified this on the G4 downstairs.

And RAM? Don't make me laugh! Try to find an AMD board that takes 4 gigabytes of RAM and USES it as fast as the fastest AMD can. every tweaker site says you can only use one 512MB part and have a max of 512MB.

Although you can't get the absolute, topped out single-CPU performance with it, dual-CPU boards like the Tyan ThunderK7Xpro support up to 4G of registered PC2100 RAM now; these boxes still comfortably beat current top-end G4s at tasks like SPEC CPU2000. If you really want a lot of memory you'll have to get a box from a major vendor; the Dell PowerEdge 6650 comes to mind as a 16G machine. Unfortunately, there aren't any AMD boxes out there like this that I know of, but Hammer will change that.

In 2002 no linux with any normal tweak allows a user task to hold and lock 1.5GB of reeal ram, its all virtual or fake.

Get an Alpha. Although I have no direct experience with this, reliable sources claim you've been able to go past the 32-bit 4G address space limit for several years.

thankfully apple is migrating to 40 bit address space physically soon in august with the new lightweight Power4.

Why wait? Apple isn't the only vendor out there.

Re:Performance tip for software on modern processo by sql*kitten · 2002-11-07 05:07 · Score: 3, Informative

Much software is not written to take advantage of the architecture of modern microprocessors. If you rewrite some of your software to take advantage of them, then it is not hard to double your speed.

It's worse than you think on PCs (whatever OS they're running). The article talks about "bus mastering" and "data tenure", but on real workstation-class hardware there is no bus (not even one with a "north bridge") there'a a proper switch, like Crossbow or GigaPlane. These give you point-to-point, non-blocking sustained peak I/O. On a switched system, if components A and B want to communicate they can do so at the switch's full speed, and so can components C and D, no contention at all. That means no wasted cycles for the bus to constantly change ownership.

If you're doing a job that requires heavy use of the "bus" on an x86 system (lots of storage I/O, lots of random memory access hence lots of L2 misses), then optimizing code for cache locality is the least of your problems, you'll never get around the fact that the inefficient design of the hardware itself is the bottleneck. Fancy FSBs and the like are just workarounds and don't address the real problem.

Re:There are too many issues, and it gets too comp by greed · 2002-11-07 05:24 · Score: 1

There is no "PACK(1)" prgma for c structures on a mac. [...]there is no penalty for use on later CPUs[...]

But on the G3, misaligned loads are corrected by software after taking a SIGBUS. So you really, really, really, really, really, really, really, really don't want to pack structures; if you want to save space, rearrange them instead.

(I don't how the G4 feels about misaligned loads--the only other PowerPCs I can get to are IBM POWER4.)

Re:There are too many issues, and it gets too comp by Anonymous Coward · 2002-11-07 05:39 · Score: 0

Thanks man, that guy is a raving looney with just enough knowledge to fool people who are a step beyond buzzwords in their understanding of the domain. Your post needs to modded up.

Re:Who cares? by Archangel+Michael · 2002-11-07 05:51 · Score: 2

The real difference between a faster computer and a slower one isn't really speed for the things people do, it is whether or not you actually do it.

Say, if something takes 7 hours on one computer (like a complex database report)but 2 minutes on another, the chances that you run it on the slower one is much less likely, and you just don't do it.

On the otherhand if the difference between doing a task is small, 5 seconds vs 10 seconds, the chances of it getting done is not much greater on the faster computer.

The differnce a fast computer has is not really speed, but rather wheter or not you do that task.

--
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.

Re:I'm no democrat. The demos are corporate whores by Lars+T. · 2002-11-07 06:46 · Score: 2

While Conservatives haven't had a new idea for as long as they can remember. That's the whole point ;-)

--

Lars T.

To the guy who modded me down from perfect to terrible Karma - Apple haters still suck

Very True by Kommet · 2002-11-07 07:05 · Score: 1

The graphs should have been discreet! Then again, maybe some of them WERE discreet, and the discreet data points blurred together.

Yeah, I'm just hammering on my "they used extremely large graph ranges" point again. Seriously, though, there is a hell of a difference between the 0- to 20-word graph on page 3 and the 0- to 400-word graphs on pages 4 and 5.

The 0- to 400-word graph on the last page doesn't even properly illustrate the difference between DDR and SDR because it uses such a huge range and totally ignores the smaller end of the spectrum (the more realistic situation) where DDR beats SDR by a wide margin.

Re:There are too many issues, and it gets too comp by Anonymous Coward · 2002-11-07 08:32 · Score: 0

You are the looney, by mistakenly thnking Jay was correct on any of his criticisms. Read the post below. Unless you are actually "Jay Carlson" stroking himself with praise", which based on the lack of facts in Jay's weak rebuttal would no doubt not be too hard a stretch to imagine.

counter-criticism to keep in mind (duplicate):
============

You (Jay Carlson) are wrong about your criticisms. A standard normal mac gets over 20,000 512 byte read IOs per second thorugh SCSI Manager 4.3 and a decent Fibre Channel card from a variety of vendors, including Astera Technologies, and even probably ATTO. Thats only 10 MEGABYTES per second.

10 MEGABYTES!

What idiot mac-bigot thinks you need PCI-X for merely 10 megabytes per second to function and doubts the figure? Why do you say "There are no Macs that support PCI-X. I am therefore suspicious of the numbers you claim for this configuration."??? Is it because you know NOTHING about PCI-X cards? PCI-X cards are backwards comaptible with PCI slots and work in 32 and 64 bit slots and at speeds from 33 to 133 Mhz inclusive. Thye would sell almost none if this was not the design goal.

I find it hilarious that you tried to find anything wrong with the post. In fact that speed is for ONE CHANNEL on a card. Most of the expensive 2,000 dollar cards have TWO channels on them. I think you really know very little about computers or at least about high end laser IO technologies..

Lets look at your other criticisms of the post. You stated "And where are these benchmarks?" The random read and write benchmarks are very common, and you can write your own its quite easy. You merely write a loop with IO data dependencies at various points and touch no more than 4 bytes from each cacheline and try to avoid hitting the same 4K page area for a while during the test. (hop through memory rapidly adding values) All the standard portable memory test programs that measure mobo latency for SIMULTANEOUS read and writes show the non-DDR macs as faster than AMD. Why do you feel the need to complain about it? Wasn't there a HUGE article at the top of this thread that tried to teach you about latency issues when biasing a system for stream speed access?

True backside cache origianally meant "no hope of cache coherency" by the way. Its what is meant by that phrase. If you want to pretend that there is no difference between all categories of L2 on-chip cache, then why even define L1? Sometime they both run at same clock speed. YOu need a new term for non-MP L2 cache in your view. The world aready created a term.... its called BACKSIDE CACHE. Deal with it.

Thats why you have to have a PIII and not a P4.... its L2 cache is hidden and cannot be retrofitted with any technology aftermarket to "just need a cache coherency protocol between your processors". YOur words make me laugh.

At this point I hardly think you are qualified to criticize the post. I think you are an envious Mac-Bigot, but I will address the rest of your pathetic statements as well.

YOus said "there is no penalty for use on later CPUs." Bwaaa ha haaaaa!!!! You must be crazy! misaligned integer accesses are very bad on PowerPC. Worse yet... the article is talking about MACINTOSH, not unix OS on a mac such as Mac OS X. Google clicks show millions of accesses per day by macs and 90% run OS 9 and older "classic mac". Only 10% run FreeBSD-based OS X. OS X is NOT the MAC, and has nothing to do with the mac other than that nowadays it runs on some of its models, but so does OS9. Compilers such as mettrowerks and from symantec lacked the ability to type and use Pack(1) from the beginning of the mac to just last year. Just because GNU supports crap like that has nothing to do with the mac programming world. Why you tried to find another mistake and failed is hilarious. Go ahead, misalign a few integers in some benchmark code and watch the speed plummet. ITS SOMETIMES handled in software as an exception! And when its in hardware its still much slower.

You tell people to buy Alpha and Hammer to get more fast RAM and don't realize that the article explains that solutions like Itanium and XEON are rich man solutions, and that running shrink wrapped software is desirable too. How much shrinkwrapped software runs on Alpha boards?

Why do you even try to comment negatively.?

You type "Why wait? Apple isn't the only vendor out there." regarding 40 bit physical addressing. You cannot use a Pentium 4, or a AMD "non-hammer" for 40 bits, so your suggestion is to run an expensive server box that is unsuited for running most software. At least with Apple, one day machines will have >32 bit physical addresses and still run countless software packages. Apple in its version of UNIX is already 64 bit clean for all its IO all the way up through the kernel, as well as prepared on the PCI DMA side, and the user task scheduler side.

I can't believe your pathetic post got +3 with not one factual statement whereas the original post you tried to criticize had over 300 factual statements.

Obviously anti-mac bigotry runs strongly on slashdot today. Unless all your "fans" are yourself Jay Carlson.

I think you tried to run RC5 on non-PowerPC systems and got grumpy or envious at the mac performance and think that there is no reason to ever shift bits in a register!?!? Even the 32 bit shift (not 64) on a pentium 4 is horrible, but I suspect you are a fan of AMD instead... or a disgruntled stockholder.

As it stands you found NOT ONE mistake in the post you are trying to slam. Though you do point out that Alpha based computers might be available and might be affordable that accept more than 4 GB of RAM.... but I doubt they are as affordable as you think they are.

Re: your sig by Anonymous Coward · 2002-11-07 08:33 · Score: 0

Danny I'm killing my eyes Killing My Danny KILL DANNY

Re:There are too many issues, and it gets too comp by Anonymous Coward · 2002-11-07 08:40 · Score: 0

Mod this rebuttal down, +4 is ridiculous for a post with not one factual statement in it. What happened to moderation here? Just because someone poots their foot in their mouth you mod it UPWARDS? I understand the irony of it, it does tech them humilty and shame, but at the same time its inverses the entire purpose of moderation. You don't mod a post full of 10 mistatements like Jay's UPWARDS, just to remind him of his mistakes, you mod it downwards.

Sheeesh.

20,000 IOs a second is only 9.7 MEGABYTES pr second. (under 10 MB). And therefore is achievable in any pci slot and does not need PCI-X speeds. (PCI-X is backward compatible). Jay was wrong about all his other 10 points as well, so modding it to +4 (3:30 pm E.S.T.) is foolish though the irony is not lost on me.

Re:JAY IS WRONG (why did his mistatements get +5? by Anonymous Coward · 2002-11-07 08:55 · Score: 0

JAY IS INSANE (why did his mistatements get +5? (4pm Eastern time).

For example here is a copy of a rebuttal posted well before he was modded to 4 or 5. :

=====

You (Jay Carlson) are wrong about your criticisms. A standard normal mac gets over 20,000 512 byte read IOs per second thorugh SCSI Manager 4.3 and a decent Fibre Channel card from a variety of vendors, including Astera Technologies, and even probably ATTO. Thats only 10 MEGABYTES per second.

10 MEGABYTES!

What idiot mac-bigot thinks you need PCI-X for merely 10 megabytes per second to function and doubts the figure? Why do you say "There are no Macs that support PCI-X. I am therefore suspicious of the numbers you claim for this configuration."??? Is it because you know NOTHING about PCI-X cards? PCI-X cards are backwards comaptible with PCI slots and work in 32 and 64 bit slots and at speeds from 33 to 133 Mhz inclusive. Thye would sell almost none if this was not the design goal.

I find it hilarious that you tried to find anything wrong with the post. In fact that speed is for ONE CHANNEL on a card. Most of the expensive 2,000 dollar cards have TWO channels on them. I think you really know very little about computers or at least about high end laser IO technologies..

Lets look at your other criticisms of the post. You stated "And where are these benchmarks?" The random read and write benchmarks are very common, and you can write your own its quite easy. You merely write a loop with IO data dependencies at various points and touch no more than 4 bytes from each cacheline and try to avoid hitting the same 4K page area for a while during the test. (hop through memory rapidly adding values) All the standard portable memory test programs that measure mobo latency for SIMULTANEOUS read and writes show the non-DDR macs as faster than AMD. Why do you feel the need to complain about it? Wasn't there a HUGE article at the top of this thread that tried to teach you about latency issues when biasing a system for stream speed access?

True backside cache origianally meant "no hope of cache coherency" by the way. Its what is meant by that phrase. If you want to pretend that there is no difference between all categories of L2 on-chip cache, then why even define L1? Sometime they both run at same clock speed. YOu need a new term for non-MP L2 cache in your view. The world aready created a term.... its called BACKSIDE CACHE. Deal with it.

Thats why you have to have a PIII and not a P4.... its L2 cache is hidden and cannot be retrofitted with any technology aftermarket to "just need a cache coherency protocol between your processors". YOur words make me laugh.

At this point I hardly think you are qualified to criticize the post. I think you are an envious Mac-Bigot, but I will address the rest of your pathetic statements as well.

YOus said "there is no penalty for use on later CPUs." Bwaaa ha haaaaa!!!! You must be crazy! misaligned integer accesses are very bad on PowerPC. Worse yet... the article is talking about MACINTOSH, not unix OS on a mac such as Mac OS X. Google clicks show millions of accesses per day by macs and 90% run OS 9 and older "classic mac". Only 10% run FreeBSD-based OS X. OS X is NOT the MAC, and has nothing to do with the mac other than that nowadays it runs on some of its models, but so does OS9. Compilers such as mettrowerks and from symantec lacked the ability to type and use Pack(1) from the beginning of the mac to just last year. Just because GNU supports crap like that has nothing to do with the mac programming world. Why you tried to find another mistake and failed is hilarious. Go ahead, misalign a few integers in some benchmark code and watch the speed plummet. ITS SOMETIMES handled in software as an exception! And when its in hardware its still much slower.

You tell people to buy Alpha and Hammer to get more fast RAM and don't realize that the article explains that solutions like Itanium and XEON are rich man solutions, and that running shrink wrapped software is desirable too. How much shrinkwrapped software runs on Alpha boards?

Why do you even try to comment negatively.?

You type "Why wait? Apple isn't the only vendor out there." regarding 40 bit physical addressing. You cannot use a Pentium 4, or a AMD "non-hammer" for 40 bits, so your suggestion is to run an expensive server box that is unsuited for running most software. At least with Apple, one day machines will have >32 bit physical addresses and still run countless software packages. Apple in its version of UNIX is already 64 bit clean for all its IO all the way up through the kernel, as well as prepared on the PCI DMA side, and the user task scheduler side.

I can't believe your pathetic post got +3 with not one factual statement whereas the original post you tried to criticize had over 300 factual statements.

Obviously anti-mac bigotry runs strongly on slashdot today. Unless all your "fans" are yourself Jay Carlson.

I think you tried to run RC5 on non-PowerPC systems and got grumpy or envious at the mac performance and think that there is no reason to ever shift bits in a register!?!? Even the 32 bit shift (not 64) on a pentium 4 is horrible, but I suspect you are a fan of AMD instead... or a disgruntled stockholder.

As it stands you found NOT ONE mistake in the post you are trying to slam. Though you do point out that Alpha based computers might be available and might be affordable that accept more than 4 GB of RAM.... but I doubt they are as affordable as you think they are.

====

I can see the Mac-bigots are swarming today, modding up 10 mistakes by Jay to a +5 post. Ugh.

Re:Bandwidth - Useless Without Latency by 0x0d0a · 2002-11-07 09:04 · Score: 2

Bandwidth is what matters, when in fact it's the the latency which kills applications

Surely you're talking about something different from the article poster, who was referring to the causes of an entirely different (and uncommon) metric: "bandwidth latency". ;-)

--
May we never see th

Re:There are too many issues, and it gets too comp by ca1v1n · 2002-11-07 09:53 · Score: 2

In most operating systems with memory protection, ALL the memory is virtual, from the standpoint of a userland program. It's up to the OS, not the hardware, to decide whether that virtual memory address is going to be mapped onto RAM or disk.

--
WARNING: there is a trojan on your

Re:Performance tip for software on modern processo by Anonymous Coward · 2002-11-07 10:26 · Score: 0

In a world full of badly written, poorly documented, and barely legible spaghetti crap code, this is bad advice. Human effort is better directed elsewhere in virtually all cases.

Write your code to be clear. Write your code to work correctly. Let the compiler worry about low-level optimizations; that's what it's for. Much of the execution time (like 80%) of a typical application will be spent in system calls, where tweaks like this will have no effect anyway.

If do find yourself in a bind for speed, most often the big gains come from better algorithms, not from twisting your structures into bizarre shapes or otherwise trying to force your own "optimizations" down the compiler's throat by outwitting it.

There are very, very few cases where this kind of tweaking is justified, and if you find yourself thinking about it, think again.

Why the compiler CAN help you by gwappo · 2002-11-07 21:46 · Score: 1

There's an oppertunity for compilers to explore which - afaik - they don't do as of yet.

Modern chips have PRELOAD instructions, which allow you to specify an address for which the CPU will attempt to preload the cacheline in advance.

Graphics drivers typically exploit this when accessing vertex buffers and whatnot. The Preload instruction will go out and fetch the data in parallel to executing what it already has in cache.

For a compiler to do this, it'd need to recognize *p++ sequences in for loops and insert an additional PRELOAD instruction after executing the increment.

Do-able, maybe Intels compiler does it, or maybe VectorC - does anyone know?

Re:Who cares? by Anonymous Coward · 2002-11-13 03:21 · Score: 0

Sounds to me like the piece of sh1t programming language is the "major impedence on your productivity." Try using a real language.

Slashdot Mirror

Understanding Bandwidth and Latency

158 comments