Intel's Itanium Processor Explained
pippa writes: "There's a technical piece [at Sharky Extreme] on Intel's Itanium, which is a new processor family and architecture, designed by Intel and Hewlett Packard, with the future of high-end server and workstation computing in mind. EPIC processors are capable of addressing a 64-bit memory space. In comparison, 32-bit x86 processors access a relatively small 32-bit address space, or up to 4GB of memory."
I've followed the IA-64 for a while. I would have to agree that it is an entirely new beast which will take longer to develop good compilers for. The x86-64 design I have just checked up on.
You have to think about what you want out of a 64 bit architecture. To me they are 64 bit addressing, and 64 bit data.
Both architectures are capable of 64 bit addressing as far as I can see (actual implementations will probably be limited e.g. the initial AMD chip will initially be 48 bits of virtual address space I believe). How each handles the moving those addresses around will be critical to performance.
The major differences are in the handling of 64 bit data. The AMD chip extends the existing size of the registers to 32 bits and adds another set of 8 registers. The Intel one on the other hands supplies *heaps* of registers. While this would put the intel chip way in front, the downside is loading & storing registers when you change stack frames (calling a function). The rotating register stack helps, but eventually nesting of procedures can result in register spill. In some ways, the IA64 resembles a stack machine, but that's a dirty word these days - perhaps the terminology was avoided for those reasons.
Having written compilers over the years, I can definitely say that the i386 has suffered a severe shortage of registers. It's not a lot better than the pdp-11 (8 registers, 1 of which was the PC). As a result of this shortage, most languages run like a dog on the x386. They were even worse with the x86 because registers were dedicated to specific activities. It's also probably a good reason why cache performance has always been critical to getting i386 working well.
Having looked at both, my money will probably be with the AMD solution as it is an incremental design, not a revolutionary change. This affects the amount of work required to port existing code bases (OS core and compilers) to 64 bits. For example, my own OS could probably be ported relatively quickly to x86-64 much more quickly than with IA64.
Eventual performance differences will probably depend on the languages implemented and the programming styles applied. In my opinion, 16 general purpose registers is probably about as many that a good optimizing compiler would need for the typical C functions.
What both chips promise is the 64 bit addressing. This opens up a new realm for OS design because it allows disks and other structures to be mapped directly into the kernel's virtual address space. This is currently not possible with the current 4G limit because already storage devices are surpassing this limit. It is about time that CPU address space exceeded that of storage as it will allows for more elegant solutions to caching, disk management and swapping.
In the long run though, a new architecture is needed. Computing is likely change signiicantly in the next 10 years with the development of AI and better ways of using computer power. Given this, the IA64 might be the one which wins out in the long run because of the totally different view of execution. It does however assume that we finally make a break from the curse of legacy computing.
Personally, I am sad to see the stack architectures like the b6000/7000 series from Burroughs (now Unisys) die. They were incredible marvels of computer engineering that were at least a decade ahead of the register architecture machines. I especially liked the concept of tagged data which enabled the software to do rather marvellous things. Just the kind of machine that could run Java quite well. It is rather curious to see the trend from highly CISC machines to progressively more RISC machines, with the burden being placed more heavily on good compiler design. Consistent with this approach, IA64 looks to be a machine that will be tightly bound to specific compiler optimization techniques, although this bothers me a little because very likely those with access to the best compilers will be the ones who get the best performance out of these beasts. Compaq because of the inherited Digital resources would have access to some of the best compiler technology on the planet. It is widely recognized that the original Bliss compiler was state of the art by miles when Digital developed it in the 70's.
MS article on IA-64 Windows.
The Register article on MS dithering on Hammer support.
Whew! I knew I saw all this crap, I just had to remember where!
--
A feeling of having made the same mistake before: Deja Foobar
IIRC correctly this may be the major downfall of the Itanium. The Itanium uses some sort of preprocessor to translate x86 instructions to the EPIC instructions the chip actually uses. It performs some optimizations as it does this to parallelize these instructions as much as possible to increase speed. Still this means the chip will have the same sorts of problems as the Pentium Pros did, they will run significantly slower on older 32bit software.
IIRC, AMD on the other hand will be bringing out a chip which is essentially 2 32bit athlon cores stuck together and linked to produce a 64bit processor. It essentially needs no translator and runs 32 bit and 64 bit equally well. This coupled with the fact that Itanium has been going nowhere slow has me looking toward AMD for a good 64bit solution.
So far I've gotten all my Karma from telling people they are wrong... :)
--
A feeling of having made the same mistake before: Deja Foobar
Well, it seems Sharky glossed right over this one. They don't seem to get what rotating registers are for. They just make some vague statement about them working well for streaming things or something. *sigh*
One of the chief techniques that VLIW (and EPIC) processors will use to extract parallelism from looping code is Software Pipelining. This technique extracts parallelism across multiple loop iterations by scheduling them in parallel. The most popular form of software pipelining, Modulo Scheduling, offsets the loop iterations by a fixed interval known as the initiation interval.
The minimum possible initiation interval for a software pipelined loop is limited by two factors: The resource bound for the loop, and the recurrence bound for the loop. The resource bound is determined by counting up all the resources the loop uses and finding the minimum # of cycles (ignoring dependences) that you could pack everything into. The recurrence bound is a little trickier.
The recurrence bound is the bound imposed by loop-carried dependences in the loop. That is -- dependences that feed from one iteration of the loop into future iterations. For instance, in the following loop, there's a dependence from the result written to "z" on one iteration to the calculation of "x" on the next:
{
-
x = z ^ 3;
}y = x + 42;
z = y * 69;
On an architecture with infinite resources, this loop is still recurrence bound by the path from x to y to z, back to x. So, what does this have to do with rotating registers?
Well, so far, I've just described flow dependences. If you pick up a copy ofHennessy and Patterson's Computer Architecture: A Quantitative Approach , you'll see that this corresponds to "Read after Write" hazards -- meaning a later instruction reads a result written by an earlier instruction. There are two other sorts of hazards to watch out for: Write-After-Write (two instructions writing to the same place have to write in order), and Write-After-Read (a later instruction might clobber a value read by the current instruction).
Write-After-Read hazards are particularly interesting in the case of software pipelined loops. First, some terminology: a value is live from its earliest definition to its last use. In the example above, x is live from the first statement until the second within the body of the loop. In a given loop, a value may be live for quite a long time. However, the initiation interval for the loop might be quite short. This can lead to problems, such as violated Write-After-Read hazards.
Suppose we have the following code:
{
-
b = a[i];
}c = b + t;
d = c + u;
e = d + v;
g[i] = e + b;
Suppose we can fit all of this into a single cycle loop on our hardware because we can do four ADDs in parallel, plus the load and the store. Notice that the instructions in the middle are just dependent on each other, and on constants that are initialized outside the loop. Notice that the final instruction uses the second-to-last ADD's result as well as the value we loaded initially.
If we try to put this into a single-cycle loop, we'll have a problem, because we'll load multiple values into b before we even get to the calculation which finds g[i]. Oops. This is because the b = a[i] from a future iteration has moved up above an instruction from the current iteration which reads b--that is, we've violated a Write-After-Read hazard. In software-pipelining parlance, this is a "live-too-long" problem. The value of b is live across multiple iterations.
In a device without rotating registers, you solve this problem by manually copying b to temporary registers. In C code, this might look like so:
{
-
b = a[i];
}b1 = b;
b2 = b1;
b3 = b2;
c = b + t;
d = c + u;
e = d + v;
g[i] = e + b3;
Fine, except that can increase codesize, and in some cases impact performance. (It is, however, the technique of choice on processors that implement a minimum of hardware, so as to save power and cost.) Rotating registers alieviate this by performing these copies implicitly whenever the loop branch is taken.
So there you have it. That's the scoop behind rotating register files.
--Joe--
Program Intellivision!
Program Intellivision!
Isn't that even more than a playstation 2?
Tarsnap: Online backups for the truly paranoid
What will dictate the success is whichever is more cost effective (read: Cheap) to consumers and purchasing agents. If AMD is dominating the shelves at Best Buy, Circuit City, et al and Itaniums move like the P4 is, you can kinda see the writing on the wall. This is the brink and AMD and Intel are heading toward it, tune in next year and watch this *EXCITING* HiTech drama play out!
Popcorn mandatory, butter and salt optional.
--
A feeling of having made the same mistake before: Deja Foobar
If you have the option of 32-bit compatibility, it may not be worthwhile to migrate existing code to 64-bit. Converting code to 64-bit makes sense if you plan on using huge files or a huge address space. Converting to 64-bit also makes sense if you can utilize efficient 64-bit integer types or other 64-bit processor features and performance that would be otherwise unavailable. Keep in mind that there are also downsides to 64-bit programs that result from the increased program memory usage because many basic data types expand from 32-bit to 64-bit quantities. Also, you may need to test and support both a 32-bit and 64-bit version of your code when a single 32-bit version would work as well. For most existing X applications, unless porting to 64-bit is required, using 32-bit compatibility is an appropriate option. For libraries, the choice of whether to support 64-bit is based on the needs of the library customers. Since a 64-bit application may require various libraries, providing 64-bit library implementations is generally a good idea even if not currently needed.
The obvious solution to the Quantum irregularity issue would be to add a thermal flux capacitor to the torque inversion matrix. This would require a slightly larger die for the CPU, but should allow for additional thermal stabilization. AMD has been doing this for several years now.
-atrowe: Card-carrying Mensa member. I have no toleranse for stupidity.
Kevin Fox
Kevin Fox
- Itanium
- Ron
- Anganese
- Latinum
- Opper
- Ickle
- Admium
- Ilver
- Ercury
- Luminum
- Agnesium
- ...and finally... Old
I really like this naming scheme, and I'm looking forward to using these Innvoative Processors.The Itanium will probably sound like another beefed up Intel chip *yawn* without much to set it apart from the crowd. (We already have lots of 64bit chips right?)
Here are a few interesting tid bits which make the Itanium something different:
- Predication. You read this part right? This means no more pipeline flushes for missed branch prediction. None. This is a big saver. Although transmetas CPU's do this (to a limited extent) with their VLIW and OS, it is still wrong on occasion (i.e., not perfect branch prediction, which itanium will effectively provide)
- Rotating registers. Why are these great? Usually you only have a few registers with CISC architectures. RISC has quite a bit more, but they are much smaller and you end up using them as much as the less populous CISC registers. Having 256 registers with the ability to cycle them means you will be hitting the L1 cache even less. While the L1 is fast, it is still at least twice as slow as hitting a register directly. This is another big bonus
- L1, L2, and L3 cache all at CPU clock speed. Most L2/L3 caches are at half speed at best.
The other enhancements, more pipelines, more ALU's, etc, are all nice but nothing ground breaking. Together with the above additions they add up to impressive performance.
The only downside with all the features is the compilers. Most of the really cool optimizations will require a compiler smart enough to translate the code effectively to ake advantage of them.
It sounds like Intel wont have a top notch compiler for another few years at best, and who knows when the GNU compiler will support even a fraction of the features.
This will be a real downer, as gcc support for Alpha's, which have been around for years and years, is still far behind digital/compaq's alpha compiler.
If you don't get it you are not a nerd and should immediately procede over to CNN where all the other cattle get their news!
--
A feeling of having made the same mistake before: Deja Foobar
For those of us who don't care to read 1200 pages about Itanium and EPIC. Intel sums it up here in a quick Itanium FAQ.
-gerbik
With all the wild speculation going on around here, I thought it might be worth throwing some actual links in here to real information.
I haven't read all of these myself, but I have poured over the details that are most relevant to my work. :-)
Have fun.
--Joe--
Program Intellivision!
Program Intellivision!
In other cases, I'd agree that legacy code performance would be a huge issue for a processor family aimed at the desktop. After all, there are so many thousands of apps that businesses and consumers rely on (some of which were written by companies that have long since died) that we couldn't possibly expect all of them to port to IA-64. Even worse, this might not be a simple recompile -- if you use any assembly or (more likely) if your code isn't 64-bit clean, you need to modify your code pretty carefully to support it.
But, luckily for them, Intel isn't targetting desktops. They're going after the very highest-end markets (especially with the first release) where users either own the code they're using (as with scientific/high performance computing) or where they rely on only one or two enterprise applications (look at the number of high-end boxes out there that basically just run Oracle, and the number of workstations that are used entirely for one CAD program). Intel just has to make sure that these key apps are really, really well-supported on IA-64 and their target customers will be happy. And they're basically paying companies to do this sort of porting (they have a $250 million IA-64 venture fund), so I have a lot of confidence that this'll work out for them.
It's also important to remember that enterprise products have a much longer purchasing cycle than consumer products. For any console system, the availability of games on Day 1 is crucial to the success of the whole system. But any reasonable enterprise can be expected to spend 9-18 months evaluating critical products before doing a serious roll-out, and that gives Intel a crucial buffer period in which to get the remaining ISVs on board.
The much tougher issue for them will be quality of the compilers themselves. The article alludes to the fact that IA-64 puts a LOT of burden on the compiler, but I think it even understates that fact. The standard gcc is woefully inadequate for this architecture, so Linux users have to hope that SGI's version comes through. Realistically, only HP (which has been working in VLIW experiments for years) can be counted on to have a good implementation ready from the launch of the chip.
--JRZ
I suppose any discussion of Intel will require the mention of AMD. While Intel has frequently admitted that this new chip will run non-native (i.e. not explictily compiled for it) code slower than current chips, AMD claims their 64-bit processor will actually run it faster through a smoother translation layer.
The question is, will developers jump on board and start recompiling? It's not as simple for other OS's as it is for Linux since the code is not available for you to do it personally.
If this chip actually runs code slower, and suffers poor backwards compatibility, what motivation is there for people to port to it? I can see specialized apps, but until Windows 2000 or other popular, but closed source Server operating systems and applications are ported, it's just an academic processor.
I guess we'll have to see if Intel can get the developers excited; but based on my purely anecdotal survey of developers in my group of friends, there isn't a lot of excitement about anything Intel does anymore, especially not this chip.
* mention of Windows 2000 as a server Operating System in no way endorses that as a Good Idea(tm)
----------------- "I have a bone to pick, and a few to break." - Refused -------------------