Porting Linux Software to the IA64 Platform
axehind writes "In this Byte.com article, Dr Moshe Bar explains some of the differences between IA32 and IA64. He also explains some things to watch out for when porting applications to the IA64 architecture."
Now I, and the other two IA64 users, will have some programs to run on our Linux-64 boxes!
Can someone please port nethack for us?
- A.P.
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"
Well obviously what we'll see next is a kernel extension that dynamically 'ports' all your applications to IA-64 and transparently migrates them to IA-64 machines elsewhere in the cluster. When Intel's next Great Leap Forward is released, you'll be able to transparently migrate to that as well. In fact it will be so transparent, you won't notice any difference and you can continue working at your 80286-based machine without any interruption.
-- Ed Avis ed@membled.com
Oh please.
return (char *) ((((long) cp) + 15) & ~15);
is not portable.
return (char *) ((((size_t) cp) + 15) & ~15);
is much better.
Ah, porting to homogeneous isa but with a bigger word size. Funny how it's the same old issues over and over again. Structs change in size, bad assumptions about the size of things such as size_t, sizeof(void *) != sizeof(int) (though sizeof(void *) == sizeof(long) seems to be pretty good at holding true here), etc. Of course now there are concerns about misaligned memory accesses, which on IA32 was just a performance hit. Most IA32 types are not used to being forced to be concerned about this (of course many *NIX/RISC types are very used to this).
When things were shifting from 16 to 32 bit (seems like just yesterday, oh wait, for M$ it was just yesterday), we had pretty much the same issues. Never had to do any 8 -> 16bit ports (since pretty much everything was either in BASIC, where it didn't matter, or assembler, which you couldn't "port" anyway).
Speaking of assembler, I guess the days of hand crafting code out of assembler is really going to take a hit if IA64 ever takes off. The assembler code would be so tied to a specific rev of EPIC, that it would be hard to justify the future expense of doing so. It would be interesting to see what type of tools are available for the assembler developer. Does the chip provide any enhanced debugging capabilities (keeping writes straight at a particular point in execution, can you see speculative writes too?). It'd be cool if the assembler IDE could automagically group parallelizable (is that a word?) together as you are coding.
-- Ed Avis ed@membled.com
Debian is already ported to the IA64 -- not sure about the number of packages ported yet, but I know they intend to release the new 3.0 (woody) with a IA64 port.
See here for more details
zadok.org.uk
In the article he mentions that itanic can execute IA32 code _and_ PA-RISC code natively, as well as its own, but these features will be taken away sometime in the future.
Does anyone remember the leaked benchmarks that showed the itanic executing IA32 code at roughly 10% of the speed of an equivalently-clocked PIII?
I wonder how it shapes up on PA-RISC performance?
It has to offer some sort of advantage over existing chips, or no one will buy it.
On the other hand, maybe its tremendous heat dissipation will reduce drastically when they remove all that circuitry for running IA32 and PA-RISC code.
Which leads me to think, why didn't they invest the time and money in software technology like dynamic recompilation, which Apple did very successfully when they made the transition from 69k to PPC?
I'm out of my tree just now but please feel free to leave a banana.
I don't see what is so obvious - isn't one of the selling points of Itanium its backward i386 compatibility? Even if running the 64-bit version of Linux it should still be possible to switch the processor into i386-compatible mode to execute some 386 opcodes and then back again. After all, the claim is that old Linux/i386 binaries will continue to work. Or is there some factor that means the choice of 32 bit vs 64 bit code must be made process-by-process?
Interesting question: which would run faster, hand-optimized i386 code running under emulation on an Itanium, or native IA-64 code produced by gcc? They say that writing a decent IA-64 compiler is difficult, and I'm sure Intel has put a lot of work into making the backwards compatibility perform at a reasonable speed (if not quite as fast as a P4 at the same clock).
-- Ed Avis ed@membled.com
"Intel can't stick with IA64 now that AMD is rolling out their 64bit chips. They'd just fall too far behind the curve."
Yeah, I mean its not like Intel knows how to develop chips or stay in business or anything.
"Derp de derp."
The examples he gives for usage of null pointers are both wrong. When a null pointer (whether written as 0 or NULL) is passed to a varargs function, it should be cast to a pointer of the appropriate type. See the comp.lang.c faq for details. The relevant questions are 5.4 and 5.6. But feel free to read them all!
A while ago, I tried compiling and running my program (http://freespeech.sourceforge.net/overflow.html) on a Linux PPC machine and (to my surprise) everything went fine. Does that mean that it should work on ia64 too since (AFAIK) both are big-endian 64-bit architectures?
Opus: the Swiss army knife of audio codec
Look for Sun and/or IBM to be selling 8-way Hammer machines by this time next year, according to my Spirit Guides.
"Only in their dreams can men truly be free 'twas always thus, and always thus will be."
--Tom Schulman
There's really not that much demand for any assembly in the industry at large. Even microcode is being done in high-level languages these days. I would wager that most of the people doing assembly coding now are in highly specialized fields, especially embedded programming. So, there isn't necessarily any more demand for x86 assembly programmers than for any other (possibly non-standard) architecture. In my opinion (and this is only opinion), while you should learn an assembly language in school to understand the basic building blocks, the choice of architecture isn't crucial. However, since it's not crucial to learn one or the other, I think they should stick with a simple one. x86 is kind of a mess; MIPS was easy to learn. As far as access to the hardware goes, there are simulators for most processors, which is sufficient for education.
-J
When I started messing about with computers 8 bit chips were stanard on the desktop and 4 bit in the embedded sphere.
Within four years 16 bit was the emerging standard for the desktop and four more than that 32 bit was emerging.
In the 12 years since then, well...
32 bit rules in both the desktop world and in the embedded world. Can someone tell me why we aren't on 128 bit chips or more by now? Why do 64 bit chips not amke it - is this a problem of the physics of mobos or what?
nvidia already has drivers out for Linux/IA64 with some of their higher end cards (quadro line).
Check out ioquake3.org for a great, free, First-Person Shooter engine!
There are two reasons:
1/ The massive amount of FP state in IA-64 (128 FP registers). So the linux kernel is compiled in such a way that only some FP registers can be used by the compiler. This means that on kernel entry and exit, only those FP registers need to be saved/restored. Also, by software conventions, these FP registers are "scratch" (modified by a call), so the kernel needs not save/restore them on a system call (which is seen as a call by the user code)
2/ The "software assist" for some FP operations. For instance, the FP divide and square root are not completely implemented in hardware (it's actually dependent on the particular IA-64 implementation, so future chips may implement it). For corner cases such as overflow, underflow, infinites, etc, the processor traps ("floating-point software assist" or FPSWA trap). The IA-64 Linux kernel designers decided to not support FPSWA from the kernel itself, which means that you can't do a FP divide in the kernel. I suspect this is what is more problematic for the application in question (load balancer doing FP computations, probably has some divides in there...)
XL: Programming in the large
-- Did you try Tao3D? http://tao3d.sourceforge.net
And forget about the problem!
Mats
It's not specific to IA64 or Linux-- PPC and IA32 also work this way, and Windows does the same thing. You can get around it, possibly, by inlining some assembly which saves and restores the FP registers before and after you use them. You need to be careful that the kernel won't switch out of context or go back to userland while you're using FP registers--preemptive kernels make this much harder.
However, there really aren't many reasons why you would want to use FP in the kernel in the first place. Real-time data acquisition and signal processing is the only example that comes to mind, but you'd be better off using something like RTLinux in that case.
I have a positive modifier on Troll. When I mod someone Troll their karma should go UP!
universe, would be:
printf( "%{pid}\n", pid );
printf( "%{uid_t}\n", getuid() );
etc.
The way it *does* work in the little universe where I am the king is:
This way is arguably better, because it's type safe, and easier on the users. Of course, since it's not Compatible With C, it will never be used by anybody
-- Did you try Tao3D? http://tao3d.sourceforge.net
I would wager that most of the people doing assembly coding now are in highly specialized fields, especially embedded programming.
As an embedded systems designer I can tell you that even here in the embedded world, assembly x86 is nowhere to be found, except for maybe in the lowlevel init. Even there, though, it's used to get the environment ready for C and calls a C function to start all the real work, very much in the same manner as the Linux kernel source shows.
Assembly programming is everywhere in the embedded world, just not x86 or anything powerful enough to be able to use a C compiler. I routinely do large Microchip PIC systems entirely in assembler, but that's only because of one of two reasons: they're not suited for C (the 18Cxxx is a different story now), or I need every last word of program and data space.
First of all, IA-64 is now called IPF (Itanium Processor Family), although I've heard rumors that this is changing again, to a third name.
Although the initial acceptance of Itanium-based servers and workstations has been slow, there is little doubt that it will eventually succeed in becoming the next-generation platform.
Actually, as /. readers know, there have been
some doubts. Itanium is 5 years late. Right now Itanium ranks lowest in
SPEC numbers, and Itanium 2 (McKinley), while
it addresses some of the problems, can't expect
to compete with Hammer or Yamhill when it comes
to integer code.
For tight floating-point loops, Itanium 2 is great -- 4 FP loads + 2 FMAs per clock. But on integer code with lots of unpredictable branches, the entire IPF architecture leaves a lot to be desired. Speculation and predication were supposed to address that, but it is very hard for compilers to exploit speculation, and predication does not address issues such as the limitations of static scheduling.
(Also, Itanium 2 removes any benefit that the SIMD instructions had on Itanium, because on Itanium 2, SIMD instructions such as FPMA are split and issued to both FPU ports, negating any performance benefit they had on Itanium. So while Itanium can perform 8 FP ops per clock with FPMA, Itanium 2 can only perform 4 FP ops per clock. This does not look good for the future of IPF implementations. But Itanium 2's bigger memory bandwidth is probably more important than SIMD instructions anyway. Itanium 2 is built more for servers, while Itanium is built more for workstations, which might benefit from SIMD MMU instructions, although the rest of Itanium, and its price/performance, make almost anything else better.)
Superscalar processors with dynamic scheduling are improving much better than was expected during IPF's design (witness the P4 and AMD chips). So Itanium's static instruction scheduling design may be a liability more than an asset today. It puts considerable burden on the compiler.
The x86 emulation and stacked register windows take up a lot of real estate on the chip, which could be better used for something else.
The IA64 can be thought of as a traditional RISC CPU with an almost unlimited number of registers.
Nonsense!!! No CPU has unlimited registers. When writing code by hand or with a compiler, registers are a limited resource which are used up quickly.
And even though IPF has "stacked" general purpose registers which are windowed in a circular queue with a lazy backing store, these windows are of limited utility in real code. How many times does real code use subroutine calls which can take heavy advantage of register windows, before call branch penalties start to negate any benefit the windowing provides?
It's a great idea in theory, but windowing just adds to the complexity of the implementation, taking up real estate that could be better used elsewhere.
The IA64 has another very important property: It is both PA-RISC 8000 compatible and IA32 compatible. You can thus boot Linux/IA64, HP-UX 11.0, and Windows on an Itanium-powered box.
Absolutely false: PA-RISC emulation was dropped years ago, before the first implementation, although it was originally planned. Also, HP-UX 11.0, which is PA-RISC only, is not supported on IPF. Only HP-UX 11.20 and later are supported. HP-UX 11.22 is the first customer-visible release of HP-UX on IPF.
The endianism (bit ordering) is still "little," just like on the IA32, so you don't have to worry about that at all.
Misleading -- the endianism is still a part of the processor state (i.e. context-dependent). This means it can be both big and little endian, and can switch when an OS switches context. HP-UX, for example, is big-endian on IPF.
The rest of the article had generic ANSI C programming tips which everyone knows already -- nothing specific to IPF.
Umm, we are talking about IA64 here... Have a look at the manual, you might then understand.
What SPEC needs to benchmark is SPECInt-per-$. Considering that commodity Athlons, Pentiums, Celerons and Durons handily beat the extremely expensive Itanic in a straight SPECInt benchmark, what's the advantage of the IA64 performing more efficiently per mhz?
It was very silly of Intel to graft a 386 unit onto the IA64 chip, that's for sure. Fast int ops are important for running databases. They are essential in supporting that 64-bit I/O.
That's been Intel's promise since they announced the chip project many, many, many years ago. They also promised that the chip would be inexpensive. It isn't very fast, it isn't a good value compared to todays 32-bit commodity CPUs.
From what I've read, the Itanic scales in a way very similar to the Hammer -- 8 CPUs at a time and if you want more than you have to run a pipe between each group of eight. Hammer claims a Hypertransport link between each set with a one cycle wait state (Intels simply calls their a pipe), but really, anything more than 8-way is still going to be the realm of POWER4, UltraSparc, etc. IMO. To tell the truth, the Itanic and the X86-64 will have very similar scaleability, the x86-64 is less than half the die size of the Itanic and better performing. It's NUMA setup gives greater throughput between multiple CPUs in an 8-way or less. It may be ugly on the inside, but both CPUs do the about same thing. And one will be faster and a whole lot cheaper. And don't forget AMD's 4-way chipset. The Taiwanese motherboard makers are going to be moving into that space with this chipset. Commoditization.
Well, just take a 32-bit commodity CPU and kludge it to 64 bits, gain about 25% speedup in doing so and SELL IT FOR AROUND $400 maximum and you will quickly see that the Itanic is sinking! Sure the x86 instruction set is lame, but that's the roll of the dice. If the Motorola 68000 had been chosen by IBM for the PC, we would be singing the same tune. I think the x86 instruction set will be around ad infinitum. Just like the accellerator pedal is on the right side, the clutch is on the left and the brake pedal is in the middle. Totally arbitrary, but it somehow stuck.
The Itanic wasn't a piece of crap 5 years ago, but it is obsolete today. Intel raves about its "266mhz" memory bus and its 66mhz-64-bit PCI support. You can get this in a commodity motherboard and two Athlon CPUs for around $600. You can get the Pentium 4 with 133mhz X4 quad-pumped memory bus nowadays. The Itanic's parallel execution method is nice, but why did they wait till the CPU was released before they began making compilers that took advantage of this? Completely useless without the right tools (assuming decent tools can be made).
"Only in their dreams can men truly be free 'twas always thus, and always thus will be."
--Tom Schulman
I'll agree that a clever programer might save a few instructions here or there, but I'll argue that in the age of RISC, and especially EPIC, instruction count does not have any effect on performance.
With a modern machine executing at 1Ghz, lets assume its throughput is close to 1 billion (10^9, I think these things are different in England) instructions per second.
So for any normal application (everything except weapon guidance, etc.) that runs for a few minutes, even if you can save 100 million (dynamic) instructions, you are not going to even notice. And just imagine how hard it is to eliminate 100 M dynamic instructions for a real, non trivial, program.
For IA-64, we expect a lot of performance to come from the fact that it can execute many instructions in parallel (thats what Intel is betting on).
It is much easier for a machine to find ILP, than for a human.
And I'm not claiming that there will not be a few cases where a programmer could write better assembly than the compiler, but that even an expert assembly programmer will get beat 99 out of 100 (at least) for IS-64.
As an aside, reorganizing data structures (usually to take advantage of the memory hiearchy) is a very hot research topic right now. Reorganizing algorithms, i.e. loop tiling, etc., has been studied for about 10 years, and is finally beginning to make its way into commercial compilers.
It was easier to beat a compiler when it was just doing register allocation for 4 GP registers. Now, as the compilers are getting more and more advanced, it is much much harder to do better than them.