Aspyr On Porting Games to the Mac
jvm writes "This in-depth interview with Aspyr's Glenda Adams over at Curmudgeon Gamer discusses in detail the issues of porting games to the Mac. Starting with Civilization on the Mac LC up through today's Tony Hawk Pro Skater 4, Glenda takes on PC vs. Mac system requirements, how games are selected for porting, patching Mac games, and some thoughts on the future." A notable quote from the interview: "The PC often lets you [code/architect] things in a sloppy manner with little penalty, but then when it gets on the Mac it drags the game down."
Blitz Basic
Converting a number between floating point and integer on Macs is actually quite expensive compared to x86 systems. Converting between double and float has a severe penalty, too.
// An untested example: // force out of register. // store. // read.
Also, Macs tend to suffer a worse penalty for CPU cache misses.
Then again, there's a handful of general purpose registers on an x86, and 64 of them on the PPC (96 if you count the Altivec registers), so it's assumed that x86 systems will optimize for these cases more than the PPC, whereas the PPC assumes you'll load what you need into registers, work on it, and go back to RAM only when absolutely forced to. This means you should define a bunch of local variables, and load as much as possible into them at the start of a function. On the PPC, this can be several orders of magnitude faster, whereas on the x86, it might be a little slower (since those locals won't translate into registers, you end up shuffling between the heap and stack and working out of RAM anyhow).
Pre-G5, there's no square root instruction on the CPU. You can fake it with the fsqrtre opcode (which all Macs have, but I believe is optional in the PPC spec, too), but this gives you extremely low precision (five bits...you can actually GUESS with more precision!), but it's fast (and frequently "good enough" after two rounds of newton-raphson)...we used this (with Newton-Raphson) in UT2003/UT2004 without any noticable rendering artifacts. The G5 has a real, full-precision square root instruction, which spanks the cheap reciprocal method to boot, but will crash your program (SIGILL) on a G4 or lower. If you just call the system's C library's sqrt(), it'll do the right thing based on the running CPU, but to give you real precision, it won't using the reciprocal opcode...so on a G5, sqrt() gives good performance, on a G4, it'll eat up tons of CPU. The flyby intro on the original Unreal Tournament was spending about 17-20% of it's CPU time in sqrt() until I swapped in the reciprocal version.
While I'm talking about sqrt(), a lot of other things developers take for granted on their x86 floating point unit, like sin(), aren't implemented in hardware on the PPC.
Division on the PPC causes a complete pipeline stall (use multiplication where possible). GCC doesn't appear to optimize this case behind the scenes at this point.
On the G5, instructions are broken up into "dispatch groups"...usually five instructions, I think, but it varies due to a few factors. If you write to a memory address and read it back in the same dispatch group, it causes a pipeline stall. This is called an "LSU reject". Developer documentation says that in these cases you should either move the store and load to seperate dispatch groups, or at least pad out the dispatch group with no-ops so the load will be in a different group. GCC doesn't doesn't necessarily optimize this for you at this point, but I'm not sure where Apple's GCC branch lies in relation to the mainline version (which can now handle this).
LSU rejects are, however, a somewhat common optimization gotcha:
static int myvar1;
myvar1 = somefunc();
int myvar2 = myvar1 + 10;
The solution is to move the addition down a few lines of code so that the compiler doesn't put it in the same dispatch group...padding out with no-ops isn't really practical, and it's something the compiler should be doing anyhow. More to the point, your x86 developer isn't going to think something as harmless as having those two lines next to each other would be an optimization issue. Why should he?
Optimizations "truths" of the x86 aren't necessarily true on the PowerPC, either:
1) Loop unrolling is generally believed to be a "good optimization" on the x86, but it thrashes your instruction cache on the PPC. Actually, this is probably true on modern x86 chips nowadays, too, but on the PPC, Cache Is King.
2) Lookup tables are generally believe
Don't say, "don't quote me," because if no one quotes you, you probably haven't said a thing worth saying.
I agree that the Mac version of Neverwinter Nights came out late, the expansion packs aren't officially for the Mac yet, and I'm sorry that you're having trouble getting a copy locally.
In the US, the Mac version is $45 vs. $30 for the PC version. Copies are fairly easy to get from most outlets that carry Mac stuff (CompUSA, Apple Stores, online at MacMall, MacZone, Amazon, etc).
The OpenKnights project has an auto-updater for the Mac version which also will auto-magically install the PC versions of the expansion packs for you.