Mac OS X Built For CISC, Not RISC
WCityMike writes "One of the programmers at Unsanity, maker of haxies, recently posted a rather shocking relevation on the company's weblog. He says that Mac OS X's Mach-O runtime ABI (Application Binary Interface) comes from a NeXTStep design for 68K processorts, and is not designed for the PowerPC architechture. Had they used the latter, things would have been approximately 10-12 percent faster. And supposedly, they can't fix it now without breaking all existing applications." The developer mentions there are workarounds in the newest GCC, but only for newly compiled programs.
Here is the text:
Unsanity.org
October 21, 2002
Mach-O ABI
Here I go again, ranting about Mac OS X bowels. This time I want to talk about particular implementation details of Mach-O runtime ABI (Application Binary Interface). Before you get too confused, there are two different things under the 'Mach-O' name:
* Mach-O ABI, which defines how every application in the system executes and calls functions (stack conventions, register usage, and more);
* Mach-O File Format, which defines a way for compiled executables to store different parts of them in the same file (compiled code, data, strings, etc).
The latter is not what I want to talk about today; the first is what puzzles me most. I admit I am just a "small programmer" with no relationship with the powers-that-be at Apple at all (this means, no insider contacts who can explain the reasoning behind the particular important design decisions to me), so my impressions, judgements and guesses expressed in this article may be slightly or totally off the mark. I, however, as many other developers who have dug deep into the implementation of such things, can see obvious drawbacks and oddities about Mach-O ABI, and this is what I am going to talk about.
Mach-O originates from NeXTstep, an operating system created at NeXT for its NeXTstation machines, and later expanded to x86 hardware. NeXTstations were originally based on Motorola 68k CPUs, just like old Macintoshes. Mac OS (classic), on the other hand, used an ABI for PowerPC which followed the ABI principles defined in a document by IBM/Moto for PPC processors. So as you all may already know, m68k and x86 are CISC architectures; PowerPC used in all new Macs is RISC.
To make a long story short, Mac OS X uses an ABI designed for CISC processors, mostly ignoring RISC design principles.
What do I mean by that? Mach-O ABI we see now used in Mac OS X is more or less a direct port of NeXT's Mach-O designed for m68k - it relies on PC (program counter) register to perform various manipulations with data (for the geeks: PC-relative addressing). There's nothing wrong with that, as its an effective and common practice, except for one little thing: there is no PC register in RISC processors (programmatically accessible). That is not a show-stopper though - Mach-O for PowerPC just takes one of 32 general purpose registers and turns it into a program counter-style register, to base all offset calculations off it. That works well, as you can see, as all of Mac OS X applications (except for the ones compiled with Carbon/CFM) use the Mach-O ABI.
That approach works well, except for one small thing: global/static data access adds about 7 cycle overhead per function, and about triple of that for cross-context calls (that is for the G4 class processor) compared to the old, Mac OS Classic ABI (excuse me for the geek talk). Mac OS Classic CFM ABI, in comparison, needed almost 0 cycles for static data access and about 5 for cross-context calls. To rephrase - applications in Mac OS X could be faster, if the Mach-O ABI followed the principles set for the PowerPC chip, and not the ones created over a decade ago for CISC ones.
This brings us the question, "how much faster would the applications be if the ABI was done right?". The answer is, according to some tests done by my friends at #macdev IRC channel, the speed gain would be 10-30%, depending on each particular application (how often does it calls functions). Realistically, the speed gain would be around 10 to 12 per cent (how do I get these numbers, below).
So why did Apple used an outdated ABI for a modern operating system? Frankly, I don't know the reason. About the best one I have heard - it saved Apple a few months in the Mac OS X development time so they didn't have to do massive updates to its NeXT-derived tool chain.
There are signs of change though -- the recent update to GCC, the compiler shipped with OSX, allows it to perform so-called -mdynamic-no-pic optimization, which hard-codes the data addresses in the code, so the result is roughly equivalent to the CFM ABI used in Mac OS Classic -- so the GCC itself, compiled with that optimization, is 10% faster. Applications, to take advantage of that, need to be recompiled, so it doesn't affect 80% of the titles already shipped for Mac OS X. Then again, the optimization above only works for executables and not shared libraries.
Either way, there is no way to change the ABI now, as it would break all of the existing applications - which is obviously not what Apple (or us) would want.
And after all, who cares about a 10% speed loss? You can always get a faster Mac, right?
Further reading:
Mach-O Runtime Architecture [developer.apple.com]
CFM-Based Runtime Architecture [developer.apple.com]
Thanks to #macdev regulars and an anonymous Apple engineer for helping me with this article.
Update 10/21: fixed a few phrases in the text to make it more clear; I've also been told OPENstep runs on RISC processors (non-PowerPC) - however, I have not investigated how the Mach-O ABI works there - quite possible it obeys the PowerPC guidelines, although I am pretty sure it does the same as on PowerPC.
Posted by slava at October 21, 2002 03:40 AM
Keep in mind that gcc3 optimizations mentioned would only affect applications, *not* the dynamic shared libraries, which are a significant part of the OS (think /System/Library/Frameworks)
> I wonder if Apple tried to keep this secret...
No. The ABI is documented with the developer tools - without it you'd have to try and reverse engineer it from existing binaries to write assembly code.
It just so happens that I friend of mine has a copy of "PowerPC Mircoprocessor Family: Programming Environments for 32-bit Microprocessors" sitting on his desk, which I grabbed. Here is how PowerPC processors branch (from sectino 4.2.4.1 of said dead-tree document):
1. Branch relative addressing mode - the immediate displacement operand is sign exteneded and added to the current instruction address to produce the branch target address. So, PC relative addressing. There is no need for a programmatically accessible program counter because this is all done by the branch execution unit. Single 32-bit instruction.
2. Branch conditional to relative addressing mode - same as branch relative addressing, except that the branch is only executed if the proper condition codes are set. Single 32-bit instruction.
3. Branch to absolute addressing - the operand address is sign extended and used as the branch target. As the name implies, this is absolute addressing. Only problem is, the operand address is only 23 bits wide in a 32-bit implementation, and with the zero pad, it gives only 25 bits of absolute address (word alignment required). So, if you absolute address anything, you can only absolute address 25 bits worth of the address space.
4. Branch conditional to absolute - same as regular absolute addressing, except that you have to encode condition codes, so the operand address is nowo only 13 bits if I read the diagrams correctly, meaning that you can only absolutely address 15 bits of address space with the zero pad.
5. Branch conditional to link register - if you clobber the link register, you can branch to a 32-bit address. Of course, you have to clobble the link register, so I would think this would be most helpful in returning from a function call, not going to it, since the link register holds the return address. And if you use it forward instead of returning, you have to load the link register.
6. Branch conditional to count register - same as link register branching as above.
All of that said, the reason that the Mac OS ABI uses PC relative addressing is because the only way to fully address a 32-bit address space is to do PC relative addressing. According to this book, there is no two instruction width branch, eg a branch instruction which encodes an entire 32-bit absolute address in two 32-bit words (one word for branch encoding and condition codes, one word for the whole 32-bit address).
This leads me to believe that there is no way to do all absolute addressing on PowerPC unless you implement new instructions (which will take more time to get to the processor, and to decode) or limit yourself to 15 or 25 bits of the address space.
So, the short version is that that there is no way for the Mac OS ABI to do absolute addressing.
Outside of a dog, a book is a man's best friend. Inside a dog, its too dark to read.
1. Don't use externs or static variables.
2. If you are going to use an extern variable in a tight loop, don't use a local variable and assign it after the loop.
3. Pass the option -mdyanmic-no-pic to gcc if the source is in the final program because it does not work in a boundle or a dynamic library (or framework).
The AIX ABI/PEF ABI uses a register called the TOC for PIC code but it is stored with the function reference so you lose one register if the Darwin ABI goes over to the PEF ABI. You get one more register to play around with if you do not use extern or static variables.
A 64-bit PowerPC chip doesn't need to "emulate" anything to run 32-bit code, unlike the Itanium, which uses completely different instructions. There should be no speed hit: the only real difference is that the CPU can perform calculations using all 64 bits. This also won't remove the speed hit caused by the ABI.
> So, the point is that in every case, some form of relative addressing is used. In order to make relocatable code, ie code that can be linked happily with other binary objects, you have to have some sort of reference address, and PC-relative addressing is the only way to do this.
.bar:
This is wrong. The PowerPC ABI, as defined by IBM, uses r2 as a TOC (Table of Contents) pointer. The PC is never needed or used as all data space references are made relative to the TOC, not the PC. Apart from being faster, this has several other advantages, not the least of which is that one copy of code can have multiple data contexts without involving VM.
int foo;
int bar(void) { return foo; }
with macho:
_bar:
mflr r0,lr
bl *+4
mflr r2
mtlr r0
addis r3,r2,ha16(foo)
lwz r3,lo16(foo)(r3)
blr
with IBM conventions:
lwz r3,foo(rTOC)
blr
I remember those days, A4 and A5 worlds, one was used for globals in applications while the other was used for globals in resource code (i.e. extensions, control panels, and modules for programs). But Apple never used the PC as the basis for the position independence on the 68K or the PPC until Darwin (Mac OS X). The problem for using the PC is that you have to do a branch and store the result only if you need to externs and such. The problem for using a register that stores the value all the time is that you are wasting a register (68K case: A4 and A5; PPC case RTOC or r2).
Apple supposedly saved some time using Mach-O because they wouldn't have to rewrite their linker to use CFM/PEF. If we're going to hack Mach-O so it uses the standard PowerPC ABI why not just support PEF/CFM?!
There are many reasons for this. Not only are there tools already available to handle PEF/CFM, you wouldn't need the whole PEF interpreter w/shim libraries to run PEF Carbon apps!
>80 column hard wrapped e-mail is not a sign of intelligent
>life
So what ABI will the libraries use?
Supporting an extra ABI isn't difficult at the execution level provided you have a separate set of libraries or shims like they do for PEF binaries now. That's also how you can run 32 and 64bit apps on the same system.
But what now? Another set of shims?
The whole situation is ridiculous. They should use PEF/CFM natively.
>80 column hard wrapped e-mail is not a sign of intelligent
>life
Lars T.
To the guy who modded me down from perfect to terrible Karma - Apple haters still suck