The Art of PS3 Programming
The Guardian Gamesblog has a longish piece talking with Volatile Games, developers of the title Possession for the PS3, about what it's like to make a game for Sony's next-gen console. From the article: "At the end of the day it's just a multi-processor architecture. If you can get something running on eight threads of a PC CPU, you can get it running on eight processors on a PS3 - it's not massively different. There is a small 'gotcha' in there though. The main processor can access all the machine's video memory, but each of the seven SPE chips has access only to its own 256k of onboard memory - so if you have, say, a big mesh to process, it'll be necessary to stream it through a small amount of memory - you'd have to DMA it up to your cell chip and then process a little chunk, then DMA the next chunk, so you won't be able to jump around the memory as easily, which I guess you will be able to do on the Xbox 360."
Apparently, the machine's use of Open GL as its graphics API means that anyone who's ever written games for the PC will be intimately familiar with the set-up.
As a programmer, I can attest to OpenGL being a God-send. Not only are programmers intimately familiar with the technology, but it was designed from the beginning with portability in mind. Direct3D, OTOH, tends to follow Microsoft's practices of hiding what's really going on behind the scenes. It's been a little while since I've bothered with Direct3D, but one of Microsoft's biggest features used to be their own SceneGraph known as "Retained Mode". For some reason, Microsoft believed that everyone would want to use their Scenegraph only and damn technological progress. Most programmers who were in the know immediately bypassed this ridiculousness and went straight for the "Immediate Mode" APIs, which weren't as well documented. (Thanks Microsoft)
Wikipedia has a comparison of Direct3D vs. OpenGL here: http://en.wikipedia.org/wiki/Direct3D_vs._OpenGL
Other than that, a computer is a computer, and game programming has always required a strong knowledge of how computers operate. So it's not too surprising that it would be "just like any other programming +/- a few gotchas".
Javascript + Nintendo DSi = DSiCade
Keep in mind that all the "extra" cores are special-purpose cores that can only execute code specifically written for them. They are not general-purpose cores so you can run 16 applications simultaneously. Also consider that the CPUs for the new consoles are targeted at consoles and not multitasking operating systems with lots of context switching. There's also the roadmap issue. Sure, this one processor will be available, but what about speed bumps and future generations?
I'm still baffled into how you can efficiently break up a game into 8 threads.
.... woops problem...need critical sections for this to operate with the graphics thread.
.... woops problem, need ritical sections for this to operate with the physics thread..
ok controller input on one..
graphics on another..
physics on a third
networking on a fourth
sound... ok no problems here, thats 5.
See, even dividing it up into 5 threads causes problems, you need to make sure that you are allowed to do something on one processor and if not you must wait on another processor to finish. critical sections are something that can ultimately cause your code to run slower than if it was not multithreaded in the first place.
More info on critical sections, and other issues involved with programming multithreaded apps can be found here
Kent Simon Multitheft Auto
There are other ways to divy up work.
If your intention is to put independent tasks out to different processors, you will run into huge issues like the ones you describe.
Instead, consider the beginning of each logical step in the game loop as a "constriction/delegation" point: You constrict, meaning that only one thread is running right now. Then, say, it's time for particles. You now wake up your eight particle worker threads, divy up the gargantuan 2000 particle emitter loop into 250 emitters each. You then instruct each particle thread to work through the 250 emitters and wait for them all to finish.
Naturally your real performance won't be as if you only had to process 250 emitters, but let's say you lose 50% due to internal synchronization, you've still processed all your particles in 25% of the time.
Another way is to pipeline the tasks: You know that all your game gizmos have to first do this, then that and then the other. You create three task threads, one that does "this", one that does "that" and one that does "the other". You feed the first gizmo to the "this" thread. When it is done, it will feed the gizmo on towards the "that"-thread. When the "that"-thread is done, it will in lastly feed the gizmo on the "the other"-thread.
But once the first thread (the "this" task) is done, it can accept a new gizmo while the "that"-thread munches on the first.
Advantage to this scheme is better memory locality (which seems like it is more important on PS3 that, say, PC) that the divide'n'conquer approach described first. Of course, individual game gizmos may have dependencies in between them, so you need a proper dependency graph to feed gizmos off the right order.
It's doable, as long as you don't think 8 threads have to independently work on completely different tasks at the same time.
(If the OS analogy is flawed, sorry).
A lot of people seem to be approaching the concept of the Cell processor improperly. The chip itself is not designed for the "Design a game in 8 threads" approach people seem to be thinking of. It's designed based on a forman/worker metaphore. The main chip handles the work of figuring out what comes next, the SPE's do the heavy lifting.
...
Don't think
Processor 1 = AI
Processor 2 = Physics
Processor 3 =
etc.
Instead picture the main CPU going through a normal game loop (simplified here)
Step 1: Update positions
Step 2: Check for collisions
Step 3: Perform motion caluclations
Step 4: AI
At the beginning of each step the main CPU farms out the work to the SPE's. So, you have a burst of activity in the SPE's for each step, thun a lull as the main core figures out what to do next.
At the end of the day, people who say "at the end of the day" just REALLY need to stop saying "at the end of the day".
The limiting factor on computing speed in the last several years has not been processor design or clock speed, but memory speed. Normal architectures feature two levels of fast SRAM to insulate the processor from the latencies inherent with accessing DRAM over a shared bus. That doesn't get rid of multi-cycle delays, it just tries to reduce their likelihood. Data cache misses are expensive, but instruction cache misses are even more expensive -- all the pipelining that modern processors use to handle large workloads efficiently will break down every time the processor stalls loading instructions from main memory.
The PS3's Cell processor offers a different solution to the problem -- sub-processors with fast local memory, and an explicitly programmed way to copy memory areas between processors (the "DMA" that the article mentions). The SPEs allow significant chunks of the batch-processing-style parts of a game to run on a processor that has no memory latencies, for data or instructions. Since memory-stall delays can run into the double digits, you can expect the performance increase from fast memory to be in the double digit range too. I've seen a public demo of some medical-imaging software that ran ~50x faster when rewritten for Cell. (The private demos I've seen are similarly impressive, but I can't describe those in detail. :-)
A traditional multi-processing architecture, like the 3 dual-core chips in the X360, has no such escape from the memory latencies. All coordination of memory state between processors (i.e. through the level 2 cache) is done on demand, when a processor suddenly finds it has a need for it. Prefetching is of course possible, but the minor efficiency gains to be made from prefetching (when they can be found at all) is vastly outweighed by the inherent efficiency of explicitly-programmed DMA transfers. Multi-buffering the DMA transfers allows the SPEs to run uninterrupted, without having to wait for the next batch of data to arrive -- something that isn't really possible with a traditional level-2-cache in a traditional multiprocessing system.
In short, the very nontraditional setup of the PS3's Cell chip is capable of vastly outpowering the traditional multiprocessor setup of the X360, mostly due to successfully eliminating memory latency.
Yes, writing code that can run like this is a major freaking pain in the ass. But so what? The biggest reason most code is hard to run on such an architecture is that the code was poorly thought out, poorly designed, and not documented. Any decently-written application can be re-factored to run like this. Besides, this is the future: Cell really does seem to solve the memory latency problem that's crippling traditional computing architectures, and the performance difference is astounding. If you can't rise to the level of code written for such a complex architecture, then your job is in danger of getting outsourced to Third World nations for $5 an hour...as it should be. So quit your whining.
(First post in ten months. Feels good!)
"Once we've identified and embraced our sickness, we'll have strength...and that's when we get dangerous." - John Waters
The Cell has been available for programming for a while now. I think reference platforms (i.e. other than PS3 prototypes) might even be available. Cell is being used for far more than the PS3. Also, sure the PS3 might run faster than 3.2 GHz, but you make that sound like a bad thing!
Between them, they have 2 MB of high-speed memory, which (as you say) is becoming fairly common for L2 cache sizes, plus it's got a traditional L2 cache. So I'm not sure what you mean by "crippled". There are plenty of computing problems (including video game development) that can fit into this sort of sub-processor/DMA-communication model. Anyone that's programmed a PS2 knows this (and you sound like a video game programmer). The Cell just pushes it further.
There are plenty of tasks that can be run independently with double-buffered batches of data, and not just scientific computing, but the sorts of tasks that are bound to be prevalent in next-generation video games. Physics simulation, whether for gameplay or weather/cloth/fur/etc. effects, can be made parallel & batchable after broadphase collision. Graphics transformation can be, as it is on the PS2.
"Complicated logic" can communicate between processors using ring buffers and short DMA messages. But that's only if the logic is truly complicated...this doesn't apply if the code is complicated because it's the usual not-designed, poorly-thought-out, uncommented, global/singleton-happy, spaghetti code, which is the real problem most of the time. The only thing that's going to hold up the software industry taking advantage of the Cell processor's capabilities is our own collective lameness.
"Once we've identified and embraced our sickness, we'll have strength...and that's when we get dangerous." - John Waters