Oh, and I should point out that while the printer ports *could* use IRQs, the Linux printer driver at the time operated the ports in polling mode, so thankfully they didn't need to eat up my precious IRQs. I didn't connect either to an IRQ. The fact that the printer ports, when run in interrupt driven mode, used IRQs 5 and 7 is the reason the boards even cared to think about these IRQs.
I once had a 486 Linux box I set up for myself and roommates to use. We had a couple spare serial terminals, so I put two serial cards in the machine so I'd have enough serial ports to cover a mouse, two terminals and a modem. With the serial cards came two printer ports, only one of which I needed. IRQs started to become a problem. Fortunately, one the I/O cards was flexible enough it allowed assigning IRQs 5 and 7 to the COM ports. These IRQs were normally used by the printer port.
Later, I added an NE2K clone card to the machine along with a sound card. One of the serial terminals was retired and replaced with my roommate's PC connected by Ethernet. The sound card insisted on IRQ 5, and the NE2K card only allowed selecting between IRQ 2 and IRQ 5. IRQ 2 was already taken--it's the "cascade" interrupt that IRQ 9 and up map to--leaving me with only IRQ 5. (I forget what else I had munching IRQs. I believe my Adaptec 2740 ate one of the upper IRQs, which would explain why I couldn't use the cascade interrupt.) What to do, what to do...
I took a jumper wire intended to hook a CD-ROM drive to the sound card (as I didn't have a CD-ROM drive at the time), and used it to jump the IRQ 7 post on one of the I/O boards to the IRQ select line on the NE2K card. Voila! The NE2K was now on IRQ 7.
I ran that way for a good year or so until I could retire the extra serial ports. I didn't retire the machine for another few years.
And I suppose you pronounce gigabyte with a soft "J", and participated in debates as to whether DOS was pronounced "doss" or "dose"?
Nobody, aside from the engineers that have to make it work, cares about the difference between the bit-rate of the channel vs. the symbol rate of the analog encoding. What's important is the bit-rate between endpoints, from one DTE to another DTE. The bitrate between the DTE (computer) and the DCE (modem) is equal to the symbol rate for RS-232 and related protocols, and therefore designations such as "2400 baud," "9600 baud," and so forth are correct when referring to what the DTE transmits and receives. The fact that the signal gets transformed to and from another encoding which decouples symbol rate from bit rate is abstracted from the user.
Granted, with more modern modems, the negotiated bit-rate of a connection may be different than the bit-rate between DTE and DCE. Still, the effective DTE to DTE bit-rate can still be referred to as baud, since baud = bps over RS-232.
--Joe
Not infectable or not affected by infection?
on
Creating Prion-Free Cows
·
· Score: 2, Interesting
Ok, so BSE damages prions which leads to all the characteristics of the disease. No prions, no disease. But does that necessarily mean no infection?
BSE can be passed to humans. Is it possible that these genetically modified cows are just modern day Typhoid Marys?
On further reflection, I wonder if it's more a marketing move than anything else. The website's already popular, so explicitly aligning yourself to the website theoretically gives you an automatic fan base. If instead, you just went out there using the name but distance yourself from the famous website and growing list of books, you might alienate the one group of people already known to appreciate the material.
I don't know a lot about the movie, but the Wikipedia page indicates it's based specifically on the website. That said, it seems natural to obtain licensing.
I don't see it as being all that different than Weird Al asking for explicit permission to do song parodies, even though copyright law doesn't specifically require it.
I'd say it's meritocracy at work. Some time ago there were a handful of "Darwin Awards" websites. Wendy's seems to have provided a consistent level of quality and relatively regular updates. The others apparently did not hold up as well. (I remember finding the others as only slightly more believable than The Weekly World News, and about as reliable.) Furthermore, Wendy was able to make successful books out of her collected stories, which probably helped. That's the very essence of de facto, really. Wendy's site filled the role better than the others, and so became the de facto source.
It's the translucent elements that seem to do it. Those drag Firefox to its knees, and according to 'top,' all the MIPS are being spent in the X server, not the browser. They need to figure out a faster way to do things that doesn't soak the X server like that.
I agree with you that out-of-order execution only gets you so far. In fact, the CPU whose architecture I work most closely with is not only in-order, it's statically scheduled and has an exposed pipeline. We've never had the dynamic scheduling hardware.:-) To truly overcome memory system bottlenecks, it will take pushing a good part of the problem back up towards the program itself. That is, you won't be able to build a large enough hardware window to hide all of the latencies and costs associated with the memory system. The program code itself will need to be structured appropriately, and that will be the responsibility of the compiler and the programmer.
There's still a place for hardware features to serve part of this need. Itanium is an in-order, statically scheduled machine. (It doesn't have an exposed pipeline though.) It provides explicitly speculated loads (with "check loads" to catch the data) to allow programs to issue memory reads as early as they can, to try to hide the L1 and part of the L2 latency. Because they're speculative, they can be moved ahead of conditional branches that might prevent their execution. For example, suppose you had:
if (!p) return; x = p->foo; y = p->bar; z = p->baz; /* etc. etc. */
All of the reads for p->whatever could be moved before the return statement by the compiler.
The programmer bears some responsibility also. In embedded contexts, this responsibility is made explicit through the DMA controller. By providing on-chip RAM and a DMA controller, the chip effectively begs the programmer to on-chip<->off-chip data movement explicitly.
For certain workloads, I think you're right that hardware thread contexts are another interesting way to make use of a central compute resource. With symmetric multithreading (SMT) and N threads, the apparent depth of the pipeline to a given thread is 1/Nth the actual depth. This allows code to run fairly efficiently, even if it's a pointer-chasing, conditional-branching nightmare. The flip side is that such an architecture puts a much greater pressure on the L1 cache, since now each thread's working set must coexist with the other N-1 threads that are coexecuting. This L1 pressure is probably the reason Pentium 4's Hyperthreading (a similar, but not identical concept) didn't provide much of a performance boost, and even slowed some things down.
So far, the direction of the future seems to be multi-core. Multiple CPUs, multiple L1s, and high bandwidth links between them. This forces the application to structure itself into parallel threads, exposing parallelism at a task level. Again, we see the notion of the hardware pushing back on the software, saying "You gotta change if you wanna keep going faster."
Ok, cool. I know IBM eschewed Altivec at first as opposed to embracing it. I know they embraced it on the PPC 970s that Apple embraced as the G5, but I couldn't remember if it ended up on the Cell processors. Thanks.
Ah, ok. Yes, the PPE is bounded by traditional L1/L2 latency indeed. One thing I'm not aware of: Does the PPE implement Altivec's streaming prefetch? I'm a memory system and CPU architect, so while I study the state of the art, that doesn't mean I can instantly rattle off individual instantiations' actual feature set.
Thank you for engaging me in a thoughtful discussion. I hope I wasn't condescending. I try to provide context behind my arguments, and often that context is intended as much for the larger audience as it is for the person I'm responding to. In this case, I also mistook your comments as applying to the SPEs, not the PPE.
In-order vs. out-of-order is mostly a red herring for the workloads that the Cell SPE runs. Indeed, memory system latency is mostly a red herring as well. I just made a pair of long winded posts here that try to cover these topics in greater depth. Granted, both you and faragon have a better grasp on the issues than most. I tend to try to explain things in the big picture, so it's accessible to a wider audience that might be curious but not (yet) informed.
Thank you for stimulating the discussion. As a CPU and memory system architect, I enjoy the opportunity to explain what we're up against, even if Slashdot isn't the greatest forum for doing so, and the explanations are sometimes unsatisfyingly short relative to the topic matter.
I'm missing a sentence I needed to fully make my point, without being apparently contradictory. I said: Indeed, memory interfaces have grown from 8 bits and 16 bits to now 128 bit and 256 bit. (A dual Opteron system with RAM populated on each memory port has a 256-bit wide memory interface, effectively.) Add after that: However, to keep pace with the phenomenal growth in CPU performance we've seen, they'd easily need to be 10x that width, depending on how you measure things.
This issue deserves greater exploration. (Warning: Long winded ramblings below, intended to give background to a wider audience. I'm a CPU architect by trade, and would like to educate while keeping the discussion accessible.) It gets tricky to measure available memory interface bandwidth, because caches distort the bandwidth requirements on the memory interface, and latency throttles the rate at which an unmodified program can make memory system requests.
Consider a hypothetical system at the turn of the 80s, running at about 0.3 MIPS (1MHz 6502), with a memory interface capable of 8Mbit/s. (1MHz x 8-bit bus.) This is a memory bandwidth to compute ratio of 27:1. And those are 8-bit MIPS. The 32-bit MIPS are probably 1/3rd to 1/4th that or worse. The compute engine is the bottleneck, and all requests complete with essentially no latency. CPU asserts the address, and the RAM asserts the data on the next cycle.
Now consider a hypothetical top of the line CPU of today, with 128-bit vector instructions and multiple integer units. If you could keep all the units fed, depending on the CPU, you can issue 8 to 16 32-bit operations (not instructions, mind you) per cycle. Assume the fastest case. At 3GHz, that amounts to 48,000 32-bit MIPS. Meanwhile, suppose the memory interface on that same CPU has grown to 128-bit x 1GHz. That's a total bandwidth of 128Gbit/s.
Compute performance has grown by a factor of 480,000 or more on 32-bit code. (Well, less on code that only needed 8 bits, but you could always throw in floating point for the ultimate coup de grace on the part of modern hardware.) Meanwhile, memory system bandwidth has grown by a factor of 16,000. The ratio of difference in this hypothetical situation is 30:1. Granted, I picked nice round numbers and assumed perfect workloads. The reality may be closer to 10:1 or less if you don't take the effect of latency on request rate into account. This assumes you interpret the loss in compute performance as a reduction in demand on the memory system, not a loss in available memory system bandwidth. If you consider a loss in memory system bandwidth, it makes the ratio look works. (See how hard it is to talk about this?)
Caches skew this tremendously. On that good ol' 6502, the memory system could service program fetches, data fetches, and still have half its bandwidth left over. Steve Wozniak used that to great effect on the Apple ][, using even cycles for the CPU and odd cycles for display refresh. Modern CPUs cache everything they can, and do so aggressively. Program fetches are serviced entirely by cache, eliminating the memory system from seeing the vast majoring of program fetches. Thus, the effect of program footprint on memory bandwidth has been very sub-linear with respect to compute rate. The data side of the equation is quite a different story.
Random, scattered scalar accesses get amplified by most caches. Caches tend to operate in terms of cache lines, so a random scalar read or write gets amplified into a full cache line transaction. Wider interfaces tend to hide this effect, especially if the width of the cache line matches or is a small multiple of the memory system interface's width. More typical program sequences have strong temporal and spatial locality, meaning that caches services the accesses directly, filtering them out of the requests going to the external memory interface. This too reduces the impact on the external memory bandwidth requirement.
I think you're confusing bottlenecks, but it's easy to do. At the very least, I may not have been entirely clear about which bottleneck I consider most important.
The SPEs have flat memory and software managed paging to help hide the latency of starting a new task on an SPE. A separate DMA controller brings code to the SPE's local memory, ideally well ahead of when it is needed. I think you're confusing the SPE's prefetch instruction with a traditional cache prefetch. The SPE uses a single high speed memory port to fetch instructions and data, and I'm pretty sure each can only access its local memory store. The SPE's fetch pipeline can hold 2.5 "fetch packets" of instructions, each packet containing 32 instructions. That prefetch amounts to 80 instructions, or 40 to 80 cycles of execution capability. (The SPE vector architecture can issue 1 or 2 vector instructions per cycle, and that's it.) Also, IIRC, branches can re-hit in this buffer, allowing tight loops to execute entirely from the prefetch buffer structure. This is entirely reasonable.
Yes, the ratio of compute power to memory bandwidth has increased enormously, but in the meantime, the amount of work the CPU does in each byte of memory has also increased noticeably. Furthermore, most interesting workloads have either good locality, or good access predictablilty. If that weren't true, then we wouldn't see noticeable gains on many workloads as CPUs got faster. Instead, we'd build ever wider memory interfaces to try to keep up. Indeed, memory interfaces have grown from 8 bits and 16 bits to now 128 bit and 256 bit. (A dual Opteron system with RAM populated on each memory port has a 256-bit wide memory interface, effectively.)
For graphics workloads, the access pattern ranges from moderately to very highly predictable. Hence the prevalence of specialized DMA engines and/or data prefetch instructions in many programmable graphics engines, including the Cell Broadband Engine. The PowerPC Altivec instruction set defines a set of streaming prefetch instructions for the same purpose. So, both PS3 and Xbox360 have well defined, well understood and effective ways to hide memory latency and to make the most of the bandwidth they have.
The RAM bottleneck I was referring to does not concern bandwidth or latency (though both are certainly an issue). It has more to do with working set. As scenes get more complex, it takes larger numbers of textures, vertices and everything else. (I hesitate to say "triangles," because they're not the only primitive you might deign to render.) Keeping all that render state in addition to world state and program code now becomes the challenge. Now the PS3 has a leg up here: The Xbox360 may not have a hard-drive, whereas the PS3 always has at least some HD. Paging textures and world data from optical media is tremendously painful. At least the PS3 can use its HD to page some of its state. Sure, hard drives are much slower than main memory, but optical media is much, much, MUCH slower than that. Think 10s of milliseconds vs. 100s to 1000s of milliseconds, depending on how much seeking you end up doing.
The more you can keep in RAM, the richer the world you can build, and the less you need to hit the spinning media. That's the bottleneck I was referring to.
I blame the new folding comments here on/.:-) I didn't see the grandparent to my post, so I thought it was someone being snarky about the PS3. Mea culpa. Thanks for pointing that out.
That said, I still think my points stand on their own in the larger conversation.
Actually, your comment does raise a valid point, and deserves a more thoughtful response.
A console definitely demands a different approach to player interaction than a desktop does, for a variety of reasons. On the desktop you always have a keyboard and always have a display capable of at least moderately high res (800x600 minimum, usually more these days), whereas on the console you pretty much never have a keyboard and are still stuck with a large number of NTSC displays. How you interact with the game at the UI level will be quite different. Even how you draw everything could be rather different once you get above the core world logic.
I'm much less convinced that the underlying physics and modeling will be that much different between the two. How you present the world to the user (the graphics rendering) and how you interact with the user (controllers, UI elements) will certainly need to be different, but how the world operates should be pretty much the same. Otherwise I can see it being very hard to tune the interactions in the virtual world to make a fun, playable game.
Physics engines are complex beasts, and there are companies that specialize in implementing just that aspect of a game engine. I can definitely see there being tuned implementations of the same physics engine for different platforms, so that the physics engine runs optimally on all of its specified targets. What I have a hard time imagining is that one target will increase the number of variables and parameters it tracks over another, or will run higher-precision calculations on one platform vs. another. As game companies move to commercial physics engines, I see them instead treating it as a black box, with known inputs giving known outputs. The remaining MIPS will then be spent on game logic, which I also see as being relatively constant across platforms, and game interface, which I see as requiring extensive adaptation for each platform. Optimizing the physics engine on a given platform just frees up MIPS for the other pieces, and the piece with the most flexibility holds the rendering and UI.
Well, it seems like it'd be more a bottleneck for the PS3 and Xbox360 than for a lot of machines. I look at the CPU-speed/GPU-speed to RAM ratio on most desktops, and 512MB is just enough for the GPU, with another 1GB to 2GB sitting out there for the CPU. When compared to 3 x 3.2GHz PPC (Xbox360) or 3.2GHz PPC + 8 SPEs (PS3's Cell Broadband Engine), even a current AMD 4x4 system (4 Althon 64s) or a Core 2 Duo system has a run for its money in processing performance. So the ratio of compute to memory is quite a bit off compared to desktop boxes. Granted, the PS3 and Xobx360 don't have all the other miscellany running in the background that a desktop has, but is it really that big of a difference?
Granted, consoles have traditionally gotten by with much less RAM than their desktop counterparts. This was especially true in the cartridge days, where the entire game image lived in ROM, but it seems like it should be less so in the era of optical-media based devices.
About the only way I can see using up all those MIPS is to enable advanced physics and simulation in the game, and enable extra rendering passes to spiffy-up the images. Now that we have a larger deployment of HD-capable displays, spending the MIPS on rendering I guess makes sense. But where are you going to put all the additional textures and data required if you don't have enough RAM? You certainly aren't going to aggressively page it from optical media.
Unless a game specifically targets a console and doesn't bother targeting a desktop in tandem, I can't see the developer getting too excited about developing advanced engines that soak the console CPUs with physics/simulation and coding a cut-down version that keeps up on the desktop. That'd make the game behave noticeably differently on the two platforms. So, we're left with graphics enhancements which only change the quality of the visible output of the game, not the gameplay itself. So, until the desktop platforms get into the same raw-compute territory as the consoles, it's very easy to imagine many of those console MIPS will be left on the table or just spent on polishing the graphics output.
Now to those of you who say "It isn't pushed to its limits unless you're always using 100% of the CPU." Pshaw. I would say a system is pushed to its limits when no one thing is the sole bottleneck all the time, the overall playability of the game doesn't suffer for it, and increasing the depth of any given element would cause the game to lag or misbehave in such a way that playability or enjoyability does suffer. The notion that you have to use every byte of RAM, fill every sector of the disc and use every issue slot on every cycle of the CPU to say you're at 100% is a silly one. It might've made sense when games were measured in kilobytes, RAM was measured in bytes and CPU was measured in kHz or MHz, but not in the modern era.
Firesign Theatre predicted this literally decades ago with their introduction of "You TV: TV for You the Viewer."
And what do we have now? YouTube. Now where's that Howl of the Wolf movie with Porgie Tirebiter starring High School Madness?
Navel gazers all the way, I tell you.
--Joe
(Now how did Commie Martyr make off w/ Morse Science High, anyway?)
You could do something more like differential GPS if you were to deploy city wide. Nothing says you have to use the satellites if your operating area is fairly local (as it would be for a black cab). More important than anything else is the mapping software, not the position determination.
Oh, and I should point out that while the printer ports *could* use IRQs, the Linux printer driver at the time operated the ports in polling mode, so thankfully they didn't need to eat up my precious IRQs. I didn't connect either to an IRQ. The fact that the printer ports, when run in interrupt driven mode, used IRQs 5 and 7 is the reason the boards even cared to think about these IRQs.
I once had a 486 Linux box I set up for myself and roommates to use. We had a couple spare serial terminals, so I put two serial cards in the machine so I'd have enough serial ports to cover a mouse, two terminals and a modem. With the serial cards came two printer ports, only one of which I needed. IRQs started to become a problem. Fortunately, one the I/O cards was flexible enough it allowed assigning IRQs 5 and 7 to the COM ports. These IRQs were normally used by the printer port.
Later, I added an NE2K clone card to the machine along with a sound card. One of the serial terminals was retired and replaced with my roommate's PC connected by Ethernet. The sound card insisted on IRQ 5, and the NE2K card only allowed selecting between IRQ 2 and IRQ 5. IRQ 2 was already taken--it's the "cascade" interrupt that IRQ 9 and up map to--leaving me with only IRQ 5. (I forget what else I had munching IRQs. I believe my Adaptec 2740 ate one of the upper IRQs, which would explain why I couldn't use the cascade interrupt.) What to do, what to do...
I took a jumper wire intended to hook a CD-ROM drive to the sound card (as I didn't have a CD-ROM drive at the time), and used it to jump the IRQ 7 post on one of the I/O boards to the IRQ select line on the NE2K card. Voila! The NE2K was now on IRQ 7.
I ran that way for a good year or so until I could retire the extra serial ports. I didn't retire the machine for another few years.
--JoeAnd I suppose you pronounce gigabyte with a soft "J", and participated in debates as to whether DOS was pronounced "doss" or "dose"?
Nobody, aside from the engineers that have to make it work, cares about the difference between the bit-rate of the channel vs. the symbol rate of the analog encoding. What's important is the bit-rate between endpoints, from one DTE to another DTE. The bitrate between the DTE (computer) and the DCE (modem) is equal to the symbol rate for RS-232 and related protocols, and therefore designations such as "2400 baud," "9600 baud," and so forth are correct when referring to what the DTE transmits and receives. The fact that the signal gets transformed to and from another encoding which decouples symbol rate from bit rate is abstracted from the user.
Granted, with more modern modems, the negotiated bit-rate of a connection may be different than the bit-rate between DTE and DCE. Still, the effective DTE to DTE bit-rate can still be referred to as baud, since baud = bps over RS-232.
--JoeOk, so BSE damages prions which leads to all the characteristics of the disease. No prions, no disease. But does that necessarily mean no infection?
BSE can be passed to humans. Is it possible that these genetically modified cows are just modern day Typhoid Marys?
--JoeOn further reflection, I wonder if it's more a marketing move than anything else. The website's already popular, so explicitly aligning yourself to the website theoretically gives you an automatic fan base. If instead, you just went out there using the name but distance yourself from the famous website and growing list of books, you might alienate the one group of people already known to appreciate the material.
*shrug*
I don't know a lot about the movie, but the Wikipedia page indicates it's based specifically on the website. That said, it seems natural to obtain licensing.
I don't see it as being all that different than Weird Al asking for explicit permission to do song parodies, even though copyright law doesn't specifically require it.
Wendy, I love the site.
My only complaint? That red arrow animation. I think I'll put an ad-blocker rule in just for it. :-)
Have you considered putting up an RSS feed?
--JoeI'd say it's meritocracy at work. Some time ago there were a handful of "Darwin Awards" websites. Wendy's seems to have provided a consistent level of quality and relatively regular updates. The others apparently did not hold up as well. (I remember finding the others as only slightly more believable than The Weekly World News, and about as reliable.) Furthermore, Wendy was able to make successful books out of her collected stories, which probably helped. That's the very essence of de facto, really. Wendy's site filled the role better than the others, and so became the de facto source.
--JoeIt's the translucent elements that seem to do it. Those drag Firefox to its knees, and according to 'top,' all the MIPS are being spent in the X server, not the browser. They need to figure out a faster way to do things that doesn't soak the X server like that.
--JoeUrgl... the <ECODE> tag is annoying. My example should read:
Carry on.
I agree with you that out-of-order execution only gets you so far. In fact, the CPU whose architecture I work most closely with is not only in-order, it's statically scheduled and has an exposed pipeline. We've never had the dynamic scheduling hardware. :-) To truly overcome memory system bottlenecks, it will take pushing a good part of the problem back up towards the program itself. That is, you won't be able to build a large enough hardware window to hide all of the latencies and costs associated with the memory system. The program code itself will need to be structured appropriately, and that will be the responsibility of the compiler and the programmer.
There's still a place for hardware features to serve part of this need. Itanium is an in-order, statically scheduled machine. (It doesn't have an exposed pipeline though.) It provides explicitly speculated loads (with "check loads" to catch the data) to allow programs to issue memory reads as early as they can, to try to hide the L1 and part of the L2 latency. Because they're speculative, they can be moved ahead of conditional branches that might prevent their execution. For example, suppose you had:
All of the reads for p->whatever could be moved before the return statement by the compiler.
The programmer bears some responsibility also. In embedded contexts, this responsibility is made explicit through the DMA controller. By providing on-chip RAM and a DMA controller, the chip effectively begs the programmer to on-chip<->off-chip data movement explicitly.
For certain workloads, I think you're right that hardware thread contexts are another interesting way to make use of a central compute resource. With symmetric multithreading (SMT) and N threads, the apparent depth of the pipeline to a given thread is 1/Nth the actual depth. This allows code to run fairly efficiently, even if it's a pointer-chasing, conditional-branching nightmare. The flip side is that such an architecture puts a much greater pressure on the L1 cache, since now each thread's working set must coexist with the other N-1 threads that are coexecuting. This L1 pressure is probably the reason Pentium 4's Hyperthreading (a similar, but not identical concept) didn't provide much of a performance boost, and even slowed some things down.
So far, the direction of the future seems to be multi-core. Multiple CPUs, multiple L1s, and high bandwidth links between them. This forces the application to structure itself into parallel threads, exposing parallelism at a task level. Again, we see the notion of the hardware pushing back on the software, saying "You gotta change if you wanna keep going faster."
--JoeOk, cool. I know IBM eschewed Altivec at first as opposed to embracing it. I know they embraced it on the PPC 970s that Apple embraced as the G5, but I couldn't remember if it ended up on the Cell processors. Thanks.
--JoeIndeed. :-)
Ah, ok. Yes, the PPE is bounded by traditional L1/L2 latency indeed. One thing I'm not aware of: Does the PPE implement Altivec's streaming prefetch? I'm a memory system and CPU architect, so while I study the state of the art, that doesn't mean I can instantly rattle off individual instantiations' actual feature set.
Thank you for engaging me in a thoughtful discussion. I hope I wasn't condescending. I try to provide context behind my arguments, and often that context is intended as much for the larger audience as it is for the person I'm responding to. In this case, I also mistook your comments as applying to the SPEs, not the PPE.
--JoeIn-order vs. out-of-order is mostly a red herring for the workloads that the Cell SPE runs. Indeed, memory system latency is mostly a red herring as well. I just made a pair of long winded posts here that try to cover these topics in greater depth. Granted, both you and faragon have a better grasp on the issues than most. I tend to try to explain things in the big picture, so it's accessible to a wider audience that might be curious but not (yet) informed.
Thank you for stimulating the discussion. As a CPU and memory system architect, I enjoy the opportunity to explain what we're up against, even if Slashdot isn't the greatest forum for doing so, and the explanations are sometimes unsatisfyingly short relative to the topic matter.
--JoeI'm missing a sentence I needed to fully make my point, without being apparently contradictory. I said: Indeed, memory interfaces have grown from 8 bits and 16 bits to now 128 bit and 256 bit. (A dual Opteron system with RAM populated on each memory port has a 256-bit wide memory interface, effectively.) Add after that: However, to keep pace with the phenomenal growth in CPU performance we've seen, they'd easily need to be 10x that width, depending on how you measure things.
This issue deserves greater exploration. (Warning: Long winded ramblings below, intended to give background to a wider audience. I'm a CPU architect by trade, and would like to educate while keeping the discussion accessible.) It gets tricky to measure available memory interface bandwidth, because caches distort the bandwidth requirements on the memory interface, and latency throttles the rate at which an unmodified program can make memory system requests.
Consider a hypothetical system at the turn of the 80s, running at about 0.3 MIPS (1MHz 6502), with a memory interface capable of 8Mbit/s. (1MHz x 8-bit bus.) This is a memory bandwidth to compute ratio of 27:1. And those are 8-bit MIPS. The 32-bit MIPS are probably 1/3rd to 1/4th that or worse. The compute engine is the bottleneck, and all requests complete with essentially no latency. CPU asserts the address, and the RAM asserts the data on the next cycle.
Now consider a hypothetical top of the line CPU of today, with 128-bit vector instructions and multiple integer units. If you could keep all the units fed, depending on the CPU, you can issue 8 to 16 32-bit operations (not instructions, mind you) per cycle. Assume the fastest case. At 3GHz, that amounts to 48,000 32-bit MIPS. Meanwhile, suppose the memory interface on that same CPU has grown to 128-bit x 1GHz. That's a total bandwidth of 128Gbit/s.
Compute performance has grown by a factor of 480,000 or more on 32-bit code. (Well, less on code that only needed 8 bits, but you could always throw in floating point for the ultimate coup de grace on the part of modern hardware.) Meanwhile, memory system bandwidth has grown by a factor of 16,000. The ratio of difference in this hypothetical situation is 30:1. Granted, I picked nice round numbers and assumed perfect workloads. The reality may be closer to 10:1 or less if you don't take the effect of latency on request rate into account. This assumes you interpret the loss in compute performance as a reduction in demand on the memory system, not a loss in available memory system bandwidth. If you consider a loss in memory system bandwidth, it makes the ratio look works. (See how hard it is to talk about this?)
Caches skew this tremendously. On that good ol' 6502, the memory system could service program fetches, data fetches, and still have half its bandwidth left over. Steve Wozniak used that to great effect on the Apple ][, using even cycles for the CPU and odd cycles for display refresh. Modern CPUs cache everything they can, and do so aggressively. Program fetches are serviced entirely by cache, eliminating the memory system from seeing the vast majoring of program fetches. Thus, the effect of program footprint on memory bandwidth has been very sub-linear with respect to compute rate. The data side of the equation is quite a different story.
Random, scattered scalar accesses get amplified by most caches. Caches tend to operate in terms of cache lines, so a random scalar read or write gets amplified into a full cache line transaction. Wider interfaces tend to hide this effect, especially if the width of the cache line matches or is a small multiple of the memory system interface's width. More typical program sequences have strong temporal and spatial locality, meaning that caches services the accesses directly, filtering them out of the requests going to the external memory interface. This too reduces the impact on the external memory bandwidth requirement.
But what about latency? That's wher
I think you're confusing bottlenecks, but it's easy to do. At the very least, I may not have been entirely clear about which bottleneck I consider most important.
First, a short primer on how the SPE works. You can find a more in-depth explanation in Al Eichenberger's paper on IBM's site.
The SPEs have flat memory and software managed paging to help hide the latency of starting a new task on an SPE. A separate DMA controller brings code to the SPE's local memory, ideally well ahead of when it is needed. I think you're confusing the SPE's prefetch instruction with a traditional cache prefetch. The SPE uses a single high speed memory port to fetch instructions and data, and I'm pretty sure each can only access its local memory store. The SPE's fetch pipeline can hold 2.5 "fetch packets" of instructions, each packet containing 32 instructions. That prefetch amounts to 80 instructions, or 40 to 80 cycles of execution capability. (The SPE vector architecture can issue 1 or 2 vector instructions per cycle, and that's it.) Also, IIRC, branches can re-hit in this buffer, allowing tight loops to execute entirely from the prefetch buffer structure. This is entirely reasonable.
Yes, the ratio of compute power to memory bandwidth has increased enormously, but in the meantime, the amount of work the CPU does in each byte of memory has also increased noticeably. Furthermore, most interesting workloads have either good locality, or good access predictablilty. If that weren't true, then we wouldn't see noticeable gains on many workloads as CPUs got faster. Instead, we'd build ever wider memory interfaces to try to keep up. Indeed, memory interfaces have grown from 8 bits and 16 bits to now 128 bit and 256 bit. (A dual Opteron system with RAM populated on each memory port has a 256-bit wide memory interface, effectively.)
For graphics workloads, the access pattern ranges from moderately to very highly predictable. Hence the prevalence of specialized DMA engines and/or data prefetch instructions in many programmable graphics engines, including the Cell Broadband Engine. The PowerPC Altivec instruction set defines a set of streaming prefetch instructions for the same purpose. So, both PS3 and Xbox360 have well defined, well understood and effective ways to hide memory latency and to make the most of the bandwidth they have.
The RAM bottleneck I was referring to does not concern bandwidth or latency (though both are certainly an issue). It has more to do with working set. As scenes get more complex, it takes larger numbers of textures, vertices and everything else. (I hesitate to say "triangles," because they're not the only primitive you might deign to render.) Keeping all that render state in addition to world state and program code now becomes the challenge. Now the PS3 has a leg up here: The Xbox360 may not have a hard-drive, whereas the PS3 always has at least some HD. Paging textures and world data from optical media is tremendously painful. At least the PS3 can use its HD to page some of its state. Sure, hard drives are much slower than main memory, but optical media is much, much, MUCH slower than that. Think 10s of milliseconds vs. 100s to 1000s of milliseconds, depending on how much seeking you end up doing.
The more you can keep in RAM, the richer the world you can build, and the less you need to hit the spinning media. That's the bottleneck I was referring to.
--JoeI mentioned as much in my post. 150MB extra background image is a far cry from 1536MB extra RAM though.
I blame the new folding comments here on /. :-) I didn't see the grandparent to my post, so I thought it was someone being snarky about the PS3. Mea culpa. Thanks for pointing that out.
That said, I still think my points stand on their own in the larger conversation.
--JoeActually, your comment does raise a valid point, and deserves a more thoughtful response.
A console definitely demands a different approach to player interaction than a desktop does, for a variety of reasons. On the desktop you always have a keyboard and always have a display capable of at least moderately high res (800x600 minimum, usually more these days), whereas on the console you pretty much never have a keyboard and are still stuck with a large number of NTSC displays. How you interact with the game at the UI level will be quite different. Even how you draw everything could be rather different once you get above the core world logic.
I'm much less convinced that the underlying physics and modeling will be that much different between the two. How you present the world to the user (the graphics rendering) and how you interact with the user (controllers, UI elements) will certainly need to be different, but how the world operates should be pretty much the same. Otherwise I can see it being very hard to tune the interactions in the virtual world to make a fun, playable game.
Physics engines are complex beasts, and there are companies that specialize in implementing just that aspect of a game engine. I can definitely see there being tuned implementations of the same physics engine for different platforms, so that the physics engine runs optimally on all of its specified targets. What I have a hard time imagining is that one target will increase the number of variables and parameters it tracks over another, or will run higher-precision calculations on one platform vs. another. As game companies move to commercial physics engines, I see them instead treating it as a black box, with known inputs giving known outputs. The remaining MIPS will then be spent on game logic, which I also see as being relatively constant across platforms, and game interface, which I see as requiring extensive adaptation for each platform. Optimizing the physics engine on a given platform just frees up MIPS for the other pieces, and the piece with the most flexibility holds the rendering and UI.
--JoeI can't see developers being excited about it though.
Well, it seems like it'd be more a bottleneck for the PS3 and Xbox360 than for a lot of machines. I look at the CPU-speed/GPU-speed to RAM ratio on most desktops, and 512MB is just enough for the GPU, with another 1GB to 2GB sitting out there for the CPU. When compared to 3 x 3.2GHz PPC (Xbox360) or 3.2GHz PPC + 8 SPEs (PS3's Cell Broadband Engine), even a current AMD 4x4 system (4 Althon 64s) or a Core 2 Duo system has a run for its money in processing performance. So the ratio of compute to memory is quite a bit off compared to desktop boxes. Granted, the PS3 and Xobx360 don't have all the other miscellany running in the background that a desktop has, but is it really that big of a difference?
Granted, consoles have traditionally gotten by with much less RAM than their desktop counterparts. This was especially true in the cartridge days, where the entire game image lived in ROM, but it seems like it should be less so in the era of optical-media based devices.
About the only way I can see using up all those MIPS is to enable advanced physics and simulation in the game, and enable extra rendering passes to spiffy-up the images. Now that we have a larger deployment of HD-capable displays, spending the MIPS on rendering I guess makes sense. But where are you going to put all the additional textures and data required if you don't have enough RAM? You certainly aren't going to aggressively page it from optical media.
Unless a game specifically targets a console and doesn't bother targeting a desktop in tandem, I can't see the developer getting too excited about developing advanced engines that soak the console CPUs with physics/simulation and coding a cut-down version that keeps up on the desktop. That'd make the game behave noticeably differently on the two platforms. So, we're left with graphics enhancements which only change the quality of the visible output of the game, not the gameplay itself. So, until the desktop platforms get into the same raw-compute territory as the consoles, it's very easy to imagine many of those console MIPS will be left on the table or just spent on polishing the graphics output.
Now to those of you who say "It isn't pushed to its limits unless you're always using 100% of the CPU." Pshaw. I would say a system is pushed to its limits when no one thing is the sole bottleneck all the time, the overall playability of the game doesn't suffer for it, and increasing the depth of any given element would cause the game to lag or misbehave in such a way that playability or enjoyability does suffer. The notion that you have to use every byte of RAM, fill every sector of the disc and use every issue slot on every cycle of the CPU to say you're at 100% is a silly one. It might've made sense when games were measured in kilobytes, RAM was measured in bytes and CPU was measured in kHz or MHz, but not in the modern era.
--JoeThis isn't an Atari 2600. Rendering and refresh are a bit more decoupled than that.
That said, I think memory will be the bottleneck before CPU and GPU.
--JoeFiresign Theatre predicted this literally decades ago with their introduction of "You TV: TV for You the Viewer." And what do we have now? YouTube. Now where's that Howl of the Wolf movie with Porgie Tirebiter starring High School Madness?
Navel gazers all the way, I tell you.
--Joe
(Now how did Commie Martyr make off w/ Morse Science High, anyway?)
You could do something more like differential GPS if you were to deploy city wide. Nothing says you have to use the satellites if your operating area is fairly local (as it would be for a black cab). More important than anything else is the mapping software, not the position determination.
--Joe