PS3 Cell Processor 'Broken'?
D-Fly writes "Charlie Demerijian at the Inquirer got a look at some insider specs on the PS3, and says, Sony screwed up big time with the Cell processor; the memory read speed on the current Devkits is something like 3 orders of magnitude slower than the write speed; and is unlikely to improve much before the ship date. The slide from Sony pictured in the article is priceless: 'Local Memory Read Speed ~16Mbps, No this isn't a Typo.' Demerjian says when the PS3 comes out a full year after the XBox360, it's still going to be inferior: 'Someone screwed up so badly it looks like it will relegate the console to second place behind the 360.'" This is the Inquirer, so take with a grain of salt. Just the same, doesn't sound too good for Sony or IBM.
Microprocessor Online has some an interesting analysis. Pay attention to page 8, where the PS2 "Emotion Engine" processor is compared to the PS3 Cell processor. This is an analyst report for the industry of microprocessors.
If you really want to dig into the details of the Cell processor, check out Sony's resources. You have to agree to a bunch of things to get to the pdfs but there's a lot of information in them. Another place you can find information is IBM's resource site which contains a lot of stuff including the programming handbook.
My work here is dung.
Pay attention. The article says that SONY is telling the developers to avoid using local memmory at all - that means, it won't be fixed in the retail version.
On the cell processor, local memory is similar to a cache, but is not "transparently" managed by the CPU. Rather, the software must explicitly say what it wants to have in the local memory.
I was just about to post the exact same thing. It amazes me that: /. editing kiddie troupe) seems to have no clue /., and it's constantly pointed out and constantly ignored
1) The poster had no clue
2) Zonk (and for that matter, the whole
3) This mistake happens _constantly_ on
4) Anyone with even a basic understanding of computers wouldn't make this mistake
Just more proof that "IT" != computer science
Does anyone ever bother reading the *IBM* documents for this? Never mind what Sony have managed to do to the cell processor, if you turn to the IBM CBEA developers handbook (page 75), you will see:
"Load and store operations (LS), 6 Clock cycles Latency". And that's the time it takes for the instruction to complete, not to be issued to memory.
(3.2Ghz / 6 cycles) * 16 bytes != 16MB/s
Personally, I'm gonna bet on IBM being right, seeing how they're the ones who made the bloody thing. I don't trust the inquirer anyway, but if those figures are true, the most likely answer is inefficiencies in their benchmarking programs, (Such as instruction starvation, a nasty side effect of using SPU's)
The Homer.
"Powerful, like a gorilla, yet soft and yielding like a Nerf ball." It was called "The Homer," I believe.
No, Sony are telling developers not to read from "local memory" using the Cell. This is not the same thing at all.
There is nothing to fix. This is by design.
The "Local Memory" is the RSX graphics memory. The Cell has no need to read from this.
The The Inquirer article is rubbish and that slide is taken out of context. It seems to imply that the Cell can only read "Cell local memory" (whatever that is) at 16MB/s.
Memory transfer bandwidth between each SPU and its SPU Local Memory is something more like 25GB/s (gigabyte per second); sustained actual bandwidth between all SPUs is greater than 100GB/s; peak theoretical is greater than 200GB/s (assuming all 8 SPUs present for simplicity).
If you had access to the full version of the presentation (part of the full Sony PS3 SDK and technotes), you'd realise that that slide is part of a presentation about the RSX (the PS3's GPU). As such, when it refers to "Local Memory", it means RSX's Local Memory (eg graphics memory, video memory, VRAM or whatever you call it in fanboy/ps3/360-is-teh-suck websites). To be understood outside that context, the columns would be better labelled "Main System Memory" and "GPU Local Memory".
The Inquirer article seems to suggest that this figure of 16MB/s (megabyte per second, by the way, what the fuck is it with journalists swapping bits for bytes? why don't they get their shift/capslock keys fixed?) is some kind of show stopper. No it isn't. It simply means that the Cell processor has 16MB/s bandwidth when reading directly from memory-mapped GPU address space. So what? Unless you're planning on calling memcpy() or some shit to bring your data back then it doesn't really matter.
On RSX-initiated transfers you have 20GB/s bandwidth to do the same transfer (from RSX local to main system memory). Cell read bandwidth of GPU memory might as well have 0MB/s (ie no connection at all) and it wouldn't matter a bit.
Sixty comments, and not one of them is actually setting this whole thing right.
Yes, the local memory can be understood as some kind of cache. It's local to the SPUs. Every SPU uses its own local memory, and can meddle around in it as it likes. The local memory is cache for the SPU, not the for the CPU.
There is no reason for the main processor to ever read from an SPUs memory.If you just want to send it more data, use a DMA. If you want to review the results of a computation, have the SPU DMA them to main memory. The speed of memory accesses from CPU to local memory is irrelevant, because it never happens.
Informative, yes (and with a good link), but not relevant to this discussion.
In the slide attached to the article, the "Local Memory" is the memory local to the RSX graphics system, NOT the Cell local memory.
"The "Local Memory" is the RSX graphics memory. The Cell shouldn't need to read this. The PS3 would still work even if the Cell couldn't read this memory at all. This memory is where you store textures and other graphics data."
IMO it's reasonable to have asynchronous communication with the graphics subsystem. The only stupid thing going on is calling graphics cards memory "Local Memory". It suggests that the X-Box got it right by having one big chunk of memory that is read by both the CPU and GPU even if most developers will make the same basic split anyway.
Ah, I was trying to come up with someway that the picture could make any sense.
The RSX can read the Cell's RAM at ridiculous speeds which is all that matters. The RSX can render out of main memory, so you shouldn't ever be using the Cell to read from the RSX's RAM at all. The Cell will probably be manipulating vector data for the RSX, but 256MB for all executable code and vector data is still more than enough. The 256MB attached to the RSX would have been used primarily for textures even if the Cell could read from it at reasonable speeds
Er, yes it is. The slide says 16MB/s, not 16Mbps, i.e. megabytes, not megabits... 16Mbps would be pretty slow!
Correct, this should have pretty much zero performance impact. This is how PC graphics cards work (except of course for some of the "integrated graphics" solutions).
The Inquirer article assumes that this makes "Local Memory" useless. This isn't the case at all, as you use it to store the graphics data that the Cell doesn't need to read.
Well, it's 16 MEGABYTES per second-- which is still ridiculous but not as ridiculous. No offense to you-- it's yet another obvious typo in the article summary (using a small "b" instead of a large "B").
Gamingmuseum.com: Give your 3D accelerator a rest.
"The "Local Memory" is the RSX graphics memory. The Cell shouldn't need to read this. The PS3 would still work even if the Cell couldn't read this memory at all. This memory is where you store textures and other graphics data.
Presumably in the (unlikely?) event you did need the output from the RSX graphics chip for manipulation by the Cell processor gubbins, you could get it to render to main memory, let the processor do the appropriate data-diddling, then have the RSX read it back again?
The 'local memory' is presumably the RSX's private play area, and thus the RSX gets maximum-stupendous-speed priority, and the Cell gets occasional access at weekends. Which is a bonus, and not even necessary...
Tedious Bloggy Stuff - hooray?
Sadly, you are also wrong.
In the slide, the "Local Memory" refers to the RSX local memory, not the SPU local memory. The article says that the next slide is Sony telling devs to use the RSX to do the transfer instead, which only makes sense if it is talking about the RSX memory.
Your conclusion is right though, as this also is memory that the Cell doesn't need to read from.
Demerjian says when the PS3 comes out a full year after the XBox360, it's still going to be inferior
Actually, it doesn't show that at all. What the slide does show, is that the PS3 has nearly twice the graphics memory bandwidth compared to the Xbox360. It is a significant advantage for the PS3.
That is because you have never done any work in 3D graphics. It isn't at all unusual for the video memory to have incredible write speeds and painfully slow read speeds (back to the CPU that is). The reason is that in 3d graphics the video card does the actual rendering. Therefore you simply tell it "I want a blue triangle at the coordinates X,Y,Z (x3) with T texture applied". The card renders it and applies the texture from texture memory and then displays it onto the screen. You never need to read the (texture) memory, because the data contained in it is throw away (why would you need to read the texture in that you sent to the card?)
So it is perfectly normal for texture memory to be nearly write-only. As long as writing to it is extremely fast (which it is in this case according to the PP slide), that isn't a problem.
apt-get install redhat please god - Me (take it easy, I love Debian)
There is indeed. This painfully slow memory operation is only for reading what is designed as 'local memory' for the graphics processors.
There are a few tricks you can use with reading from there, but most of the time you don't really need the main CPU to snoop in there.
"I Know You Are But What Am I?"
Yes, in fact the article quotes the next slide as saying "Don't read from local memory, but write to main memory with RSX(tm) and read it from there instead" which is exactly what you suggest.
And they were right. PS2 had the muddiest textures among the three consoles.
--They say only a fool looks at the finger pointing to the sky...
Take a closer look at the linked image. The two top colums are CELL. Not RSX, CELL.
And the theoretical bandwidth numbers listed for CELL to main memory are those of the direct XDR interface. You'll note that the RSX has much lower numbers because it accesses main memory through a bridge bus (much like a graphics card on PCIe).
On the Cell, there is only one thing local memory can mean, and that is the local memory of each SPE.
NOTE: this can be a serious issue, because each SPE MUST read instructions and write results to the local memory. It is up to the main processor to load instructions into this memory from main memory, and to copy results from this local memory to main.
Man is the animal that laughs.
And occasionally whores for Karma.
Either that, or a broken benchmark. Each Cell processor (Synergistic Processing Element -- SPE) shares its instruction fetch port with its data memory port. The SPE can buffer up 80 instructions at a time (2.5 fetch words), plus an additional 32 from a branch target. Fetch will stall if the memory system gets saturated with loads and stores. Properly written memory-intensive code includes explicit fetches to keep these buffers full. Incorrectly written code will cause problems. Still, that doesn't explain a 3 orders of magnitude drop.
If you look at the slides on the page I linked to above, you'll see the SPEs are not connected into the global address space. They connect to a private single ported memory, and to each other through two unidirectional rings. (The ring structure is not apparent from that diagram, but trust me, it's there.) These rings then connect to a DMA engine.
If you wade through this paper, you'll see that the Cell compiler implements a software cache. (The same paper also explains the instruction fetch mechanism mentioned above, BTW.) That is, it emulates a cache in software, using the DMA to actually move memory around. Depending on the nature of the benchmark and how it was written, it could be that the read benchmark spends all its time allocating stuff into this cache and waiting for it to arrive. Writes would be faster because the cache can "write behind" without having to wait for the allocation to happen, if the compiler is smart enough to know that the previous data will be entirely overwritten. So, if the benchmark goofed, then the results are meaningless.
Fact of the matter is that the SPEs are capable of reading 128 bits a cycle each (128 bytes / cycle across the 8 SPEs). Other benchmarks, such as the article recently posted to Slashdot about using Cell for scientific computation confirm that this thing hauls--and these are bandwidth-intensive tasks. The quoted paper did run some numbers on real silicon and showed numbers similar to their simulation results.
With all this in mind, I find it hard to believe that Cell is broken.
--JoeProgram Intellivision!
(Note that at the time of this article, the 7800GTX retailed for $599 and the price of the PS3 was unknown. The comment about $599 refers to the 7800GTX.)
In my opinion, the quote clearly states that the RSX is more powerful than the 7800. Even if you view it as ambiguous, the Inquirer still chose to run a story based on a misinterpretation of an unconfirmed quote which was posted on a message board by a user with no credentials. The original article is still uncorrected.
It's there for a reason.
My flame:
"I'm sure you'll get a lot of these messages, but hell, you deserve it.
The slow read speed you noted in the slide is for Cell reading from the RSX's local memory. Such accesses are expected to be very slow. If you look at this USENIX article from one of the Linux DRI folks, you can see this quite easily:
DRI article
He shows how painfully slow it is to read from AGP or framebuffer memory (14 and 5 MB/sec, respectively), on a Rage 128 graphics card. For the CPU to framebuffer read, which is the equivalent to what we're talking about here, the read speed is 1/40th the write speed. At 16MB/sec read and 4GB/sec write, the PS3 is actually right in line with what can be expected of modern GPU architectures.
Reading from the framebuffer is just slow unless you have a unified memory architecture. The CPU and the GPU aren't cache-coherent, which means every access to framebuffer memory (or even AGP memory, which is actually a chunk of system memory allocated to the GPU) must be an uncached access. Uncached accesses are just plain slow, on any architecture.
The way your article is written, it makes it seem like Cell reads its local storage at 16 MB/sec. That is, of course, bollocks, since IBM has shown benchmarks of the Cell local storage achieving 98% efficiency. If you had any journalistic integrity at all, you'd post a retraction on your site, and a clarification of the technical issues involved."
A deep unwavering belief is a sure sign you're missing something...
No offense, but the 3DO was not a "miserable failure" in fact it sold very well and had some great titles. The 3DO was also not "open" it was a _franchise_ where manufacturers could use the design specs and pay a royalty for each system sold as well as no game licensing restrictions and a royalty of $3 per game.
Street Fighter 2, NFS, Road Rash, Dragon's Lair, EA Boxing, Gex, and more.
It was expensive, but offered a high quality arcade-like experience. The lack of licensing also led to a very large library which was good, but a lot of the games were crap which was bad. It was in stark contrast to Nintendo at the time, and a good idea and a good way to get lots of games out there quickly for their system.
http://teasphere.wordpress.com - A little spot of tea
The entertaining thing is that this particular problem is something quite natural to PCs. Framebuffer accesses on PCs have been slow ever since graphics cards started sporting dedicated coprocessors. You take a brand spanking new PC, and start downloading stuff from the framebuffer, your transfer rate will be abysmal.
The problem here isn't just a lack of embedded hardware knowledge. It's just a lack of knowledge.
A deep unwavering belief is a sure sign you're missing something...