Linux x32 ABI Not Catching Wind
jones_supa writes "The x32 ABI for Linux allows the OS to take full advantage of an x86-64 CPU while using 32-bit pointers and thus avoiding the overhead of 64-bit pointers. Though the x32 ABI limits the program to a virtual address space of 4GB, it also decreases the memory footprint of the program and in some cases can allow it to run faster. The ABI has been talked about since 2011 and there's been mainline support since 2012. x32 support within other programs has also trickled in. Despite this, there still seems to be no widespread interest. x32 support landed in Ubuntu 13.04, but no software packages were released. In 2012 we also saw some x32 support out of Gentoo and some Debian x32 packages. Besides the kernel support, we also saw last year the support for the x32 Linux ABI land in Glibc 2.16 and GDB 7.5. The only Linux x32 ABI news Phoronix had to report on in 2013 was of Google wanting mainline LLVM x32 support and other LLVM project x32 patches. The GCC 4.8.0 release this year also improved the situation for x32. Some people don't see the ABI as being worthwhile when it still requires 64-bit processors and the performance benefits aren't very convincing for all workloads to make maintaining an extra ABI worthwhile. Would you find the x32 ABI useful?"
no
If I wanted to divide my nice big memory space into 32-bit address spaces, I'd dig my totally bitchin' PAE-enabled Pentium Pro rig out of the basement, assuming the rats haven't eaten it...
I do not see many cases where this would be useful. If we have a 64-bit processor and a 64-bit operating system then it seems the only benefit to running a 32-bit binary is it uses a slightly smaller amount of memory. Chances are that is a very small difference in memory used. Maybe the program loads a little faster, but is it a measurable, consistent amount? For most practical use case scenarios it does not look like this technology would be useful enough to justify compiling a new package. Now, if the process worked with 64-bit binaries and could automatically (and safely) decrease pointer size on 64-bit binaries then it might be worth while. But I'm not going to re-build an application just for smaller pointers.
The maintainer(s) find it interesting, and they're developing it on their own dime... so I don't get the hate in some of these first few posts. No one's forcing you to use it, or even to think about it when you're coding something else.
If it's useful to someone, that's all that matters.
#DeleteChrome
x32 at least has some merit, unlike your grasp of the history of computing. (Just not very much and probably not worth the trouble; you can probably relate.)
The company I work for compiles almost all programms with 32 bits on x86-64 CPUs. It's not only cheap RAM usage, it's also expensive cache which is wasted with 64 pointer and 64 bit int. Since 3 GB is much more than our programms are using, x86-64 would be foolish. I'm eager waiting for a x32 SuSE version.
I would not go that far since I'm sure a special case may exist, but that's exactly what it would be for. Hence the 'no massive wide scale adoption' or 'applications written for this' becomes an (what should be) obvious outcome.
If I'm custom Joe and see a workload that benefits from 32 vs. 64bit OS constraints I load a 32bit OS. The reason we went to larger memory however means those special cases are extremely rare today. They happen more because "we can't get new hardware" than by choice.
-The wise argue that there are few absolutes, the fool argues that there are no probabilities.
The idea makes sense in theory. Build binaries that are going to be smaller (32-bit binaries have smaller pointers compared with 64-bit) and faster (because the code is smaller, in theory cache should be used more efficiently and accesses to external memory should be reduced).
But I suspect the problem is that the benefits simply outweigh the inconvenience of having to run with an entirely separate ABI. I doubt the average significant C program spends a lot of time doing direct addressing, and as such I suspect the size benefits of using 32-bit pointers is overstated.
Memory? What about cache? Is cache dirt cheap?
For some workloads, it's ~40% faster vs amd64, and for some, even more than that vs i386. For a typical case, though, it's typical to see ~7% speed and ~35% memory boost over amd64.
As for memory being cheap, this might not matter on your home box where you use 2GB of 16GB you have installed, but vserver hosting tends to be memory-bound. And using bad old i386 means a severe speed loss due to ancient instructions and register shortage.
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
It's not just about "having enough RAM". While that certainly is a factor, it's not the only one. As you suggest, pretty much everyone has enough RAM to run just about any normal application with 64-bit pointers.
But if you want speed, you also have to pay attention to things like cache lines. 64-bit pointers often means larger instructions are needed to be encoded to do the same work, larger instructions means more cache misses. This can be a large difference in performance.
debootstrap --arch=x32 unstable /path/to/chroot http://ftp.debian-ports.org/debian/
Requires an amd64 kernel compiled with CONFIG_X86_X32=y (every x32-capable kernel can also run 64 bit stuff).
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
Who would want this, some niche embedded guys?
Not many NEGs are using 64 bit processors, and this ABI offers too little advantage to bother with. Most embedded systems run a single primary process. If that process fits in a 4GB address space (as is required to use this ABI), then the system would just use a native 32 bit ABI on a 32 bit CPU, not this 32 bit ABI on a more expensive 64 bit CPU.
My dad drives a Ford and your dad drives a Chevy. Your dad sucks.
Didn't we do this already? Like when we were twelve years old.
Wouldn't this require all common shared libraries (glib, mpi, etc.) to be recompiled for both x86-64 and x32? What am I missing here?
we went 64 bit for a reason.
We went to x86-64 for three reasons: 64-bit integer registers, more integer registers, and 64-bit pointers. Some applications need only the first two of these three, which is why x32 is supposed to exist.
I could get into specifics but I shan't, because what you're blathering about has zero relevance for x32. It's not a replacement-to-be for the usual amd64 ABI, nobody is going to break amd64 to make x32 run. It's mostly a specialist tool for specific workloads (aside from being a hacker's playground, as are many things). Whether thinking it's useful as such is misguided or not, you're more so.
In answer to my question, no, it is not dirt cheap. For any size cache you will get fewer cache misses if your data structures are smaller than if they are larger. Until the cache is so big that everything fits in it, you always win if you can double what you can cram into it.
You've not understood this correctly. x32 is an enhancement and optimization for executable files that do not require gigabytes of RAM, primarily regarding performance. It has nothing to do with the availability or lack of RAM in the system, or how much RAM costs to buy in the computer store.
Signature intentionally left blank.
You've just misunderstood it. It is in essence a performance enhancement, and you would benefit from it simply from selecting x32 target (instead of x86-64) when compiling.
Signature intentionally left blank.
Of course its a tradeoff, because the new RAM will have less of its spare ECC bits used up.
ECC memory is artificially expensive. Were ECC standard as it ought to be, it would only cost about 12.5% more. (1 bit for every byte) That is a pittance when considering the cost of the machine and the value of one's data and time. It is disgusting that Intel uses this basic reliability feature to segment their products.
Which is all nice and good except this implies your data structure was mostly pointers to begin with
And that's exactly the case of scripting languages, where every structure (say, a Python object) is a collection of pointers to methods and data.
`echo $[0x853204FA81]|tr 0-9 ionbsdeaml`@gmail.com
This sure feels a lot like a throwback to the old 16-bit DOS days, where you had small/medium/large memory models depending on the size of your code and data address spaces. We've already got 32-bit mode for supporting pure 32-bit apps and 64-bit mode for pure 64-bit; supporting yet a third ABI is just going to result in more bloat as all the runtime libraries need to be duplicated for yet another combination of code/data pointer size.
I hate to say this since I'm sure a lot of smart people put significant effort into this, but it seems like a solution in search of a problem. RAM is cheap, and the performance advantage of using 32-bit pointers is typically small.
ABI = Application Binary Interface. Defines the pointer sizes and conventions for passing function arguments at the object code level (among other things). The ABI determines how the compiler generates object code for function call/entry/exit, and the width of pointer types.
API defines the interfaces seen by the programmer.
That's right. Unfortunately it's called the market. The same boneheads that says x32 isn't worth it, are the same boneheads which have no idea how ECC is important, how hard it is to properly code everything worrying about cache hits is. Probably people that never wrote a single line of C or assembly code.
But the Intel way of making the same physical hardware cost 50% more (with a simple on/off switch) will continue until ARM Cortex start giving intel some real competition (at least competing with the latest gen core i3).
In the ARM world, you can still get a 10 yr world CPU design (and for a pitance) because there's no forced obsolecence like Intel does.
Anxiously waiting for quad core Cortex A57 chromebooks in 2014 with 4GB of RAM. And a raspberry pi (or similar) with Cortex A53.
While it's possible to have a system with 16GB that could use only x32 (the kernel is still x86_64 under x32, so the kernel can see the 16GB), for instance running thousands of tasks using up to 4GB each just fine, plus the page cache is a kernel thing, so the I/O cache can always use all memory.
On the other hand, there are workloads that run on a 4GB system but that need x86_64 (mmaping of huge files for instance), and so boneheaded tasks reserve tons of never used RAM, it could actually use 1GB of RAM but reserve 8GB, the issue there really should be putting the coder in jail, but I digress.
But the vast majority of linux workloads today that use even a 8GB system would run just fine under x32. Like 95-98%.
And nobody is even suggesting a mainstream linux distro without x86_64 userland. I'm sugesting all standard tools using x32, but keeping the x86_64 shared libraries and compilers, so if you need you could use some apps with full 64bit capability. Just use x32 by default.
Plus it's a good way to remind lazy developers that no matter how cheap RAM is, you should be serious about being efficient (specially to the KDE developers) !
KDE functionality is great, but they really have no clue about efficiency (RAM and CPU).
Colonel Panic to be precise. He reports directly to General P. Fault.
No, it's not the same.
The idea is that you use the 32 bit pointer model, with 32 bit indirect instructions, but you're doing it all using the x86-64 instruction set. Ie, the task is in 64 bit mode. The 64 bit mode includes primarily more registers, so you can write / compile to tighter code.
The stuff you described is for running 32 bit binaries that use the i386/i485/i586 instruction set, complete with the limited set of temporary registers. x86-64 has many more registers to use.
It's not just about cache lines. :)
With x32 you get: .so files).
- You get 16 registers instead of 8. This allows much more efficient code to be generated because you don't have to dump/reload automatic variables to the stack because the register pressure is reduced.
- You also get a crossover from the 64 bit ABI where the first 6 arguments are passed in registers instead of push/pop on the stack.
- If you need a 64 bit arithmetic op (e.g. long long), compiler will gen a single 64 instruction (vs. using multiple 32 ops).
- You also get the RIP relative addressing mode which works great when a lot of dynamic relocation of the program occurs (e.g.
You get all these things [and more] if you port your program to 64 bit. But, porting to 64 bit requires that you go through the entire code base and find all the places where you said: ...
int x = ptr1 - ptr2;
instead of:
long x = ptr1 - ptr2;
Or, you put a long into a struct that gets sent across a socket. You'd need to convert those to int's
Etc
Granted, these should be cleaned up with abstract typedef's, but porting a large legacy 32 bit codebase to 64 bit may not be worth it [at least in the short term]. A port to x32 is pretty much just a recompile. You get [most of] the performance improvement for little hassle.
It also solves the 2037 problem because time_t is now defined to be 64 bits, even in 32 bit mode. Likewise, in struct timeval, the tv_sec field is 64 bit
Like a good neighbor, fsck is there
wrong architecture.
Cost sensitive embedded systems use ARM based microprocessors to which this is not applicable.
I can recompile and run 20 year old SunOS apps no problem with OpenSolaris. Try that with Linux?
Depends on what it's looking for, but in theory should work. 20 years? CLI or GUI based? Probably wants TCL/TK and/or Motiff if it's GUI, make sure they're installed. I'm willing to try, if you have source code that old...
Hairyfeet mentioned he tried linux and people kept calling back angry that their printer stopped working after an Ubuntu update.
I did not even know it existed? I will keep Linux on a VM I suppose but only CentOS as Redhat likes to make somewhat ABIs that do not break after each freaking update!
If you need stability then you should go with a stable OS. Fedora, OpenSuSE, and Ubuntu change too fast for enterprise use - which is what makes RHEL great.
With that said, I don't seem to have issues running some older software I have laying around for Linux. Oracle Database 8 installed on RHEL 5 when I tried it last, old version of Code Forge IDE ran in new Fedora Linux (think I installed it last on FC 16, designed for Red Hat Linux 5.x/6.x (old Red Hat, not RHEL). Similar results with Matlab. The software isn't broken by kernel changes - the libraries needed do change (static linked vs dynamic linked, makes a big difference in how long your software lasts) (stuff looking for a particular glib or libc seem to be the biggest offenders in Linux, from what I've seen). Windows has seen some issues with that over the years (dropping DOS libraries, dropping Win 16 from 64 bit Windows, etc).
Most UNIX operating systems do seem to maintain greater compatibility in userland, but I've issues on IRIX with stuff built for 5.x not working on my systems (Octanes running 6.5.x) - but it's the same deal - dynamically linked programs not being able to find their libraries.
I'm starting to think GNU is the problem with "GNU/Linux" these days.
He's right. If you mix x32 and amd64 binaries on the same system, then you need two copies of every shared library that they use to be mapped at the same time. And this means that every context switch between them is going to be pulling things into the i-cache that would already be present (assuming a physically-mapped cache, which is a pretty safe assumption these days) because the other process is using them.
This is why x32 doesn't make sense on a consumer platform like Ubuntu unless the entire system is compiled to use it, making the entire article a 'well, duh'. The real advantage of x32 is on custom deployments and embedded systems where you can build everything in x32 mode.
Oh, and on the subject of caches, x86 chips typically have 64 byte cache lines. If you make pointers 4 bytes instead of 8, then you can fit twice as many in a cache line, which is usually nice. It can be a problem for multithreaded applications though, because you may now end up with more contention in the cache coherency protocol.
I am TheRaven on Soylent News
The C standard does not guarantee that sizeof(long) is as big as sizeof(void*). The type that you want is intptr_t (or ptrdiff_t for differences between pointers). If you've gone through replacing everything with long, then good luck getting your code to run on win64 (where long is 4 bytes).
I am TheRaven on Soylent News
Won't this require a 2nd copy of the shared libraries in memory, which will negate the benefit of a slightly smaller binary?
You're running a Python script and you care about L1/L2 cache efficiency??
Your system is probably context switching between hundreds of MB.
Amongst.
Your system is context-switching amongst hundreds of MB.
Frohe Weihnachten from das Grammer-SS!
Yes and no. The larger your cache, the higher its latency. Can't get around this. L1 caches tend to be small to keep the execution units fed with typically 1 or 2 cycle latencies. L2 caches tend to be about 16x larger, but have about 10x the latency.
L2 cache may have high latency, but it still has decent bandwidth. To help hide the latency, modern CPUs have automatic pre-fetching and also async low-priority pre-fetching instructions that allow programmer to tell the CPU to attempt to load data from memory into L1 prior to needing it, and only if the CPU finds an open slot for memory access.
After a certain size, "normal" cache is slower than main-memory. That's why we're starting to see integrated eDRAM, which is mostly just system memory built into the CPU or package. The other issue you need to be careful about is each layer of cache adds accumulative fixed latency.
The easiest way to hide high latency is to have lots of concurrent work going on. Hyper-threading banks a lot on this. When one virtual code is stalling on memory access, the other code can step in and make use of any free execution units on a per cycle basis. Because there are lots of units, there is usually an idle one somewhere that can be used. Ironically, having two virtual cores sharing the same resources means resources are split, primarily the L1 cache. While hyper-threading helps hide latency caused by memory access, it also increases the chance of an address getting evicted from L1.
To help with this, Intel increased the size of their L1 cache on the more recent CPUs, but this also increased the latency from 1 cycle to 2 cycle. To help compensate, they increased the bandwidth of the L1 and allow larger loads. Twice the bandwidth, twice the size, but twice the latency. Single thread code takes a minor hit, but concurrent work stands to gain a decent amount.
Increasing cache sizes is not as simple as it seems.