Intel's Itanium Processor Explained
pippa writes: "There's a technical piece [at Sharky Extreme] on Intel's Itanium, which is a new processor family and architecture, designed by Intel and Hewlett Packard, with the future of high-end server and workstation computing in mind. EPIC processors are capable of addressing a 64-bit memory space. In comparison, 32-bit x86 processors access a relatively small 32-bit address space, or up to 4GB of memory."
Well.. for all you interested in the itanium's there is a not to bad press release at Intel's website here:p 083199.htm
/.org
http://www.intel.com/pressroom/archive/releases/s
And.. Some Architectural Designs / information here:
http://developer.intel.com/design/ia-64/
Slishdot.org..Splashdot.org..sdot.org..Ah damn it..
I've followed the IA-64 for a while. I would have to agree that it is an entirely new beast which will take longer to develop good compilers for. The x86-64 design I have just checked up on.
You have to think about what you want out of a 64 bit architecture. To me they are 64 bit addressing, and 64 bit data.
Both architectures are capable of 64 bit addressing as far as I can see (actual implementations will probably be limited e.g. the initial AMD chip will initially be 48 bits of virtual address space I believe). How each handles the moving those addresses around will be critical to performance.
The major differences are in the handling of 64 bit data. The AMD chip extends the existing size of the registers to 32 bits and adds another set of 8 registers. The Intel one on the other hands supplies *heaps* of registers. While this would put the intel chip way in front, the downside is loading & storing registers when you change stack frames (calling a function). The rotating register stack helps, but eventually nesting of procedures can result in register spill. In some ways, the IA64 resembles a stack machine, but that's a dirty word these days - perhaps the terminology was avoided for those reasons.
Having written compilers over the years, I can definitely say that the i386 has suffered a severe shortage of registers. It's not a lot better than the pdp-11 (8 registers, 1 of which was the PC). As a result of this shortage, most languages run like a dog on the x386. They were even worse with the x86 because registers were dedicated to specific activities. It's also probably a good reason why cache performance has always been critical to getting i386 working well.
Having looked at both, my money will probably be with the AMD solution as it is an incremental design, not a revolutionary change. This affects the amount of work required to port existing code bases (OS core and compilers) to 64 bits. For example, my own OS could probably be ported relatively quickly to x86-64 much more quickly than with IA64.
Eventual performance differences will probably depend on the languages implemented and the programming styles applied. In my opinion, 16 general purpose registers is probably about as many that a good optimizing compiler would need for the typical C functions.
What both chips promise is the 64 bit addressing. This opens up a new realm for OS design because it allows disks and other structures to be mapped directly into the kernel's virtual address space. This is currently not possible with the current 4G limit because already storage devices are surpassing this limit. It is about time that CPU address space exceeded that of storage as it will allows for more elegant solutions to caching, disk management and swapping.
In the long run though, a new architecture is needed. Computing is likely change signiicantly in the next 10 years with the development of AI and better ways of using computer power. Given this, the IA64 might be the one which wins out in the long run because of the totally different view of execution. It does however assume that we finally make a break from the curse of legacy computing.
Personally, I am sad to see the stack architectures like the b6000/7000 series from Burroughs (now Unisys) die. They were incredible marvels of computer engineering that were at least a decade ahead of the register architecture machines. I especially liked the concept of tagged data which enabled the software to do rather marvellous things. Just the kind of machine that could run Java quite well. It is rather curious to see the trend from highly CISC machines to progressively more RISC machines, with the burden being placed more heavily on good compiler design. Consistent with this approach, IA64 looks to be a machine that will be tightly bound to specific compiler optimization techniques, although this bothers me a little because very likely those with access to the best compilers will be the ones who get the best performance out of these beasts. Compaq because of the inherited Digital resources would have access to some of the best compiler technology on the planet. It is widely recognized that the original Bliss compiler was state of the art by miles when Digital developed it in the 70's.
Comment removed based on user account deletion
Real cache memory?
Dolla' dolla' bills, y'all!
Will the real Bruce Perens Please Stand Up
MS article on IA-64 Windows.
The Register article on MS dithering on Hammer support.
Whew! I knew I saw all this crap, I just had to remember where!
--
A feeling of having made the same mistake before: Deja Foobar
This means no more pipeline flushes for missed branch prediction. None.
Ok, perhaps I need to re-read my textbooks, but I seem to have missed the part about branch mispredictions requiring a full pipeline flush. As far as I can tell, all that would actually happen is the speculated instructions being invalidated in-flight, with other instructions proceeding as normal. You still get a delay - it's the equivalent of a stall of as many cycles as it took to figure out which way the branch really went - but certainly not a full flush.
Is there some mechanism that I don't know about at work here, or have Sharky et. al. just turned "stall" into "flush" because of miscommunication?
IIRC correctly this may be the major downfall of the Itanium. The Itanium uses some sort of preprocessor to translate x86 instructions to the EPIC instructions the chip actually uses. It performs some optimizations as it does this to parallelize these instructions as much as possible to increase speed. Still this means the chip will have the same sorts of problems as the Pentium Pros did, they will run significantly slower on older 32bit software.
IIRC, AMD on the other hand will be bringing out a chip which is essentially 2 32bit athlon cores stuck together and linked to produce a 64bit processor. It essentially needs no translator and runs 32 bit and 64 bit equally well. This coupled with the fact that Itanium has been going nowhere slow has me looking toward AMD for a good 64bit solution.
So far I've gotten all my Karma from telling people they are wrong... :)
What we're talking about here, is that Intel is not in the leader role anymore. They are either neck and neck with AMD or slightly behind. This has caused Intel to react, and we've seen them doing a few face-plants by this course: Rollout-recall 1.13GHz P3, roll out P4 with massive heatsink sans multiprocessor capacity with Rambus only memory. The next P4 will be the smaller 0.13 micron process, with multi-processor capacity and probably require much less cooling. In effect, what you're getting at Best Buy right now is their Beta version.
Only two things could explain Intel foisting this dead end initial P4 on the market: (1) PR - recovering the speed crown (doubtful it's just this) (2) Use initial sales to pay off R&D costs.
Interstingly a P4 Xeon will be available by Q2, yet the 0.13 micron process be up until Q3.
Skiing term, when one falls face first. Such activity requires ski buddies to "dust" him or her. Cries of, "Dust Him! Dust Him! Ski over his face!" are optional.
--
A feeling of having made the same mistake before: Deja Foobar
Register windows and rotating registers are two different things entirely. The former is used for context switching between functions. The latter is for hardware-assisted, software-controlled register renaming in software pipelined loops.
Register windows slide up and down, providing a (theoretically infinite) stack in the register file. Each positioning of the window provides a "context", which represents the set of registers provided to a function at the function-call boundary. The chip implements a fixed number of contexts, and if you exceed the sliding window in one direction or the other, you take a fault and the fault handler slides the context for you. Presumably, you stay within the chip's implemented contexts most of the time and avoid faults. Such a technique saves you from having to push/pop as many registers around function calls.
Rotating registers work in a modulo fashion, with N registers (configured by the user on IA-64, as I recall) rotate every time a special branch is taken. (On IA-64, they have a "software pipeline branch" which triggers register rotation.) That's a completely different purpose.
A separate facility that IA-64 provides is a set of rotating predicates that can be used to provide "stage predicates." This gives you a mechanism for generating prologs and epilogs from a software pipeline kernel. This is the "avoiding bloat" bit you referred to. While this is still part of the rotating registers, it deserves special mention because it's a distinct use from the other uses of rotating registers that I've discussed elsewhere on this article.
And as for bitwise rotation, the IA-64 does provide that with the shrp instruction. You just provide the same argument to both halves of the pair.
--Joe--
Program Intellivision!
Program Intellivision!
--
A feeling of having made the same mistake before: Deja Foobar
Unfortunately, it seems that this post triggered an abuse of this system which quickly brought upon the collapse of the galaxy's economic system. Guess I won't be getting one, damned paradoxes.
Well, it seems Sharky glossed right over this one. They don't seem to get what rotating registers are for. They just make some vague statement about them working well for streaming things or something. *sigh*
One of the chief techniques that VLIW (and EPIC) processors will use to extract parallelism from looping code is Software Pipelining. This technique extracts parallelism across multiple loop iterations by scheduling them in parallel. The most popular form of software pipelining, Modulo Scheduling, offsets the loop iterations by a fixed interval known as the initiation interval.
The minimum possible initiation interval for a software pipelined loop is limited by two factors: The resource bound for the loop, and the recurrence bound for the loop. The resource bound is determined by counting up all the resources the loop uses and finding the minimum # of cycles (ignoring dependences) that you could pack everything into. The recurrence bound is a little trickier.
The recurrence bound is the bound imposed by loop-carried dependences in the loop. That is -- dependences that feed from one iteration of the loop into future iterations. For instance, in the following loop, there's a dependence from the result written to "z" on one iteration to the calculation of "x" on the next:
{
-
x = z ^ 3;
}y = x + 42;
z = y * 69;
On an architecture with infinite resources, this loop is still recurrence bound by the path from x to y to z, back to x. So, what does this have to do with rotating registers?
Well, so far, I've just described flow dependences. If you pick up a copy ofHennessy and Patterson's Computer Architecture: A Quantitative Approach , you'll see that this corresponds to "Read after Write" hazards -- meaning a later instruction reads a result written by an earlier instruction. There are two other sorts of hazards to watch out for: Write-After-Write (two instructions writing to the same place have to write in order), and Write-After-Read (a later instruction might clobber a value read by the current instruction).
Write-After-Read hazards are particularly interesting in the case of software pipelined loops. First, some terminology: a value is live from its earliest definition to its last use. In the example above, x is live from the first statement until the second within the body of the loop. In a given loop, a value may be live for quite a long time. However, the initiation interval for the loop might be quite short. This can lead to problems, such as violated Write-After-Read hazards.
Suppose we have the following code:
{
-
b = a[i];
}c = b + t;
d = c + u;
e = d + v;
g[i] = e + b;
Suppose we can fit all of this into a single cycle loop on our hardware because we can do four ADDs in parallel, plus the load and the store. Notice that the instructions in the middle are just dependent on each other, and on constants that are initialized outside the loop. Notice that the final instruction uses the second-to-last ADD's result as well as the value we loaded initially.
If we try to put this into a single-cycle loop, we'll have a problem, because we'll load multiple values into b before we even get to the calculation which finds g[i]. Oops. This is because the b = a[i] from a future iteration has moved up above an instruction from the current iteration which reads b--that is, we've violated a Write-After-Read hazard. In software-pipelining parlance, this is a "live-too-long" problem. The value of b is live across multiple iterations.
In a device without rotating registers, you solve this problem by manually copying b to temporary registers. In C code, this might look like so:
{
-
b = a[i];
}b1 = b;
b2 = b1;
b3 = b2;
c = b + t;
d = c + u;
e = d + v;
g[i] = e + b3;
Fine, except that can increase codesize, and in some cases impact performance. (It is, however, the technique of choice on processors that implement a minimum of hardware, so as to save power and cost.) Rotating registers alieviate this by performing these copies implicitly whenever the loop branch is taken.
So there you have it. That's the scoop behind rotating register files.
--Joe--
Program Intellivision!
Program Intellivision!
Probably a lot sooner if Intel start giving VTune to developers for free. Which makes sense if they want to sell CPUs in competition with AMD.
Umm, exactly what 64-bit AMD chip were you talking about again?
-atrowe: Card-carrying Mensa member. I have no toleranse for stupidity.
Isn't that even more than a playstation 2?
Tarsnap: Online backups for the truly paranoid
Actually, if you read this closely, and have read all the OTHER "stuff" put out by Intel on this processor, you'll recognize this article as being REALLY close to one published about five years ago, BY INTEL !! Plus, it's "technical" if your in marketing, but not really technical at all, if your job is on the tech side of life ! I am disheartened that I can find such tripe on Slashdot.
P.S.: The IA64 is VLIW, don't let the pukes who MARKET the thing FOOL you. EPIC == VLIW.
So, it's a solution again? And what problems come with it, then?
Chips from the PPro onwards have support for addressing up to 64GB of memory, even tho a single process can only address 4GB. This is useful for programs like databases which can run multiple instances to split work up. The extensions are called "PAE" if you want to go browse the source code.
Just like Intel wasn't targetting the home market with the P4, just before we read about Best Buy recalls P4 systems. Ok, so Intel lied and they really are trying to sell the P4, not next year, but immediately upon rollout to Jane and Joe Consumer. Don't believe for a half a second that by the end of next year you won't see Itanium systems on sale at Best Buy. Intel has clearly changed their strategy, illustrated by numerous P4 price cuts prior to Nov. 20 formal announcement. They can just as easily readjust their pricing for Itaniums.
If those lousy imitators over at AMD are selling 64 bit processors on ASUS motherboards with VIA chipsets and DDR SDRAM through Best Buy, you can damn well bet Intel won't sit quietly, particularly after their P3 1.13Ghz fiasco trying not to be shown up by that lousy imitator.
--
A feeling of having made the same mistake before: Deja Foobar
What will dictate the success is whichever is more cost effective (read: Cheap) to consumers and purchasing agents. If AMD is dominating the shelves at Best Buy, Circuit City, et al and Itaniums move like the P4 is, you can kinda see the writing on the wall. This is the brink and AMD and Intel are heading toward it, tune in next year and watch this *EXCITING* HiTech drama play out!
Popcorn mandatory, butter and salt optional.
--
A feeling of having made the same mistake before: Deja Foobar
The Pentium Pro and Pentium II both run 16-bit code "slower" (clock for clock) than their predecessors. All it takes is higher clocks and an end-of-life on the P4 and Itanium will take over handily just like the PPro and PII did.
If you have the option of 32-bit compatibility, it may not be worthwhile to migrate existing code to 64-bit. Converting code to 64-bit makes sense if you plan on using huge files or a huge address space. Converting to 64-bit also makes sense if you can utilize efficient 64-bit integer types or other 64-bit processor features and performance that would be otherwise unavailable. Keep in mind that there are also downsides to 64-bit programs that result from the increased program memory usage because many basic data types expand from 32-bit to 64-bit quantities. Also, you may need to test and support both a 32-bit and 64-bit version of your code when a single 32-bit version would work as well. For most existing X applications, unless porting to 64-bit is required, using 32-bit compatibility is an appropriate option. For libraries, the choice of whether to support 64-bit is based on the needs of the library customers. Since a 64-bit application may require various libraries, providing 64-bit library implementations is generally a good idea even if not currently needed.
A sarcastic term used to designate software and hardware products that have been announced and advertised but are not yet available.
-atrowe: Card-carrying Mensa member. I have no toleranse for stupidity.
Is the L1 cache subset redundant of L2. I assume L2 is subset redundant of the L3. Or is each cache independent? Anyone know. Sounds like this chip is going to starve on memory.
Someone you trust is one of us.
If Linux is the target platform, which chip is better? Compiling linux in 64bit seems straightforward enough. So wouldn't the irridium be the better choice, as its less backwards compatibile? The AMD seems more virsitile, but with linux, you can compile out the 32 bit problem. Both chips seem like they'll starve on too little cache sizes.
Someone you trust is one of us.
...means a limit of 18,446,744,073,709,551,616 bytes of memory (18,446,744 terabytes or 18,446 petabytes). Compare that with the pathetic 4,294,967,296 offered by 32 bit architectures.
If you bought 128MB dimms it'd still take 144,115,188,075 dimms to reach the maximum memory supported by a 64-bit architecture. A dimm is 5.25 inches long (yes, i really just measured one), so if you layed the number of dimms required end to end, you'd have a line of dimms 756,604,737,398 inches long, or 11,941,362 miles. That's 1/9th the distance to the sun.
That is a DAMN lot of ram.
This is somewhat true but not completely. Predication is a form of speculative execution, but is qualitatively different from the speculative execution that most CPUs do when they branch-predict. The problem is that these architectures don't really have a way to execute down both sides of a branch. To equal what predication provides, you'd actually need to be able to fetch down several code paths in parallel and know which instructions to discard. Icky. Predication allows that to happen in a single code path, because you can put both "if (cond)" and "if (!cond)" paths directly in parallel, or even better, "if (cond1)", "if (cond2)" ... "if (condN)".
Predication is very useful for eliminating short branches and flattening small switch-case statements into effectively straight-line code. It's a much, much, much more effective method for speculative execution than trying to fetch and execute down multiple code-paths.
--Joe--
Program Intellivision!
Program Intellivision!
From what I understand, the P4 is currently aimed at workstations and servers and not consumer PC's. Intel won't be advertising the P4 because it isn't targeted for the average consumer.
-atrowe: Card-carrying Mensa member. I have no toleranse for stupidity.
The obvious solution to the Quantum irregularity issue would be to add a thermal flux capacitor to the torque inversion matrix. This would require a slightly larger die for the CPU, but should allow for additional thermal stabilization. AMD has been doing this for several years now.
-atrowe: Card-carrying Mensa member. I have no toleranse for stupidity.
It remains to be seen if IA64 can really have that much physical memory.
While the capability may be there, I don't think it will happen before the chip is completely obsolete.
64bit address space > 16,777,216 Terabytes of memory.
It's a new Instruction Set Architecture. IA-64 has only ever been available on one chip, and that's Merced--I mean Itanium.
Really? I don't see why this is always the case. Can't I design a 32-bit EPIC processor? Or am I missing something here?
You totally got rotating registers wrong. ... They are not good because 'risc registers are small' they are good because you can make your instruction code much smaller, I.e. insted of having code that says R1=R1+R2,R2=R2+R3,R3=R3+R4.... You can say R1=R1+R4,rotate,repeat.
My appologies, I ran two points into one with that statement. The one point being rotating registers, the other being 256 registers, which is quite an increase for CISC/EPIC.
Kevin Fox
Kevin Fox
From what I understand, the P4 is currently aimed at workstations and servers and not consumer PC's. Intel won't be advertising the P4 because it isn't targeted for the average consumer.
I totally disagree. The Pentium IV does not have a SMP capable design, so the server and high-end workstation market is still being fed the P3. Also, clock cycle for clock cycle, it's slower than the P3, so it really isn't aimed at the lower-end workstation market. Right now, all it has going for it is some fancy MHz rating, which appeals to gamers and consumers that want the latest-greatest thing.
Doh!
- Itanium
- Ron
- Anganese
- Latinum
- Opper
- Ickle
- Admium
- Ilver
- Ercury
- Luminum
- Agnesium
- ...and finally... Old
I really like this naming scheme, and I'm looking forward to using these Innvoative Processors.The Itanium will probably sound like another beefed up Intel chip *yawn* without much to set it apart from the crowd. (We already have lots of 64bit chips right?)
Here are a few interesting tid bits which make the Itanium something different:
- Predication. You read this part right? This means no more pipeline flushes for missed branch prediction. None. This is a big saver. Although transmetas CPU's do this (to a limited extent) with their VLIW and OS, it is still wrong on occasion (i.e., not perfect branch prediction, which itanium will effectively provide)
- Rotating registers. Why are these great? Usually you only have a few registers with CISC architectures. RISC has quite a bit more, but they are much smaller and you end up using them as much as the less populous CISC registers. Having 256 registers with the ability to cycle them means you will be hitting the L1 cache even less. While the L1 is fast, it is still at least twice as slow as hitting a register directly. This is another big bonus
- L1, L2, and L3 cache all at CPU clock speed. Most L2/L3 caches are at half speed at best.
The other enhancements, more pipelines, more ALU's, etc, are all nice but nothing ground breaking. Together with the above additions they add up to impressive performance.
The only downside with all the features is the compilers. Most of the really cool optimizations will require a compiler smart enough to translate the code effectively to ake advantage of them.
It sounds like Intel wont have a top notch compiler for another few years at best, and who knows when the GNU compiler will support even a fraction of the features.
This will be a real downer, as gcc support for Alpha's, which have been around for years and years, is still far behind digital/compaq's alpha compiler.
I know people have posted about this earlier, but I thought I would ask this as a question,
Are there really a lot of people out there who need a processer that can deal with more then 4 gigs of RAM?
Is this a Windows NT things? Because even some pretty well used Linux\FreeBSD servers are running quite well on 486s with 32 megs of RAM and other ridiculously lowend hardware like that.
So, would anyone out there who is currently using more then even 256 Megs of RAM tell me?
Hopefully I didn't put any [] around my words.
Seeing as how I said kernel AND drivers, and we're finding more than 64 meg of memory mapping for video drivers today, I think it was a very conservative estimate. Note I'm referring to HW BIOS mappings, not Linux based drivers.
-Michael
-Michael
If you don't get it you are not a nerd and should immediately procede over to CNN where all the other cattle get their news!
--
A feeling of having made the same mistake before: Deja Foobar
A 4000 processor workstation? ;)
- - - - - - - - -
Actually, I just looked it up, and you're wrong. On Page 4-6 of IA-64 Application Developer's Architecture Guide, Rev 1.0, it says specifically, and I quote: (emphasis mine)
So there.
--Joe--
Program Intellivision!
Program Intellivision!
Oh, so now 4GB of RAM is considered "small"? What planet did you come from, mister "caviar-for-breakfast"?
"Ancillary does not mean you get to rule the world." --U.S. Circuit Judge Harry Edwards, speaking to the FCC's lawyer
For those of us who don't care to read 1200 pages about Itanium and EPIC. Intel sums it up here in a quick Itanium FAQ.
-gerbik
Not bitwise rotating a register.
This is already done in the SPARC ISA, which means that your register number is a now an offset in the "window register" (stack) not an absolute offset.
It has already been investigated numerous times, and apparently you don't gain that much for the SPARC implementation.
Now for the IA64, it enables some kind of automatic software pipelining which is really nifty (if a bit dificult to undestand).
The only thing I'm wondering is : at which price (how mant transistors, etc) does it come?
Still avoiding code bloat while having software pipelining is really neat!
With all the wild speculation going on around here, I thought it might be worth throwing some actual links in here to real information.
I haven't read all of these myself, but I have poured over the details that are most relevant to my work. :-)
Have fun.
--Joe--
Program Intellivision!
Program Intellivision!
Consider the x86 to be the old 68K code and you're Motorola. You WANT the old code to die..
In a previous slashdot article, Intel has tried to patent their EPIC code.. What this means is that IF, by some stretch of the imagination they pull off an industry swing to IA-64 processors, then they're all of a sudden the only game in town.
Beyond that, IA-64 is slated (at least initially) to be a no holds barred processor.. The Ferrari of cars, since they spared little or no expense on functional units or cache (at-speed 4Meg cache?? Especially after they recently determined that they couldn't rely on 3'rd party or external modules). They say 1 to 4,000 processors are to be in these new machines.. That doesn't sound like your Mother's Word processor running in there does it?
Once these bad boys hit 1.5Ghz (maybe 2-5 years from now), then the fact that your emulating Duke Nukem 3D is going to be irrelevant, just like the old non-recompiled 68K code is largely antiquated. Or more directly, just like we could care less that the PPro and it's descendants run 16bit code more slowly.
At the moment, if you're going to buy an IA-64, you have a SINGLE app that you're interested in. Namely a web server, a database server, or a CAD program. If you're in a UNIX system (as has been pointed out), then it's a trivial matter of getting the code to work anew (since either the code is freely available, or the vendor that gives you the box owns the original code). If you're in windows, you still only need worry about your AutoCad xxx, etc. So-what that explorer runs slower; 733MHZ isnt' going to let it run _that_ slowly.
The type of people that shell out $20,000 on a machine could care less what the architecture is.. And to some degree, they care little about the compatibility from box to box... They have their software, and as long as it works fastest on a given platform this year, they'll buy it and switch over (since the data will migrate). Think of it like a black box phone.. A chordless phone.. Do we fret over the fact that we've gone from 400MHZ to 900MHZ to 900-Dig to 2.4Gig Spread Spectrum? Each time switching vendors? None are compatible with each other, but they fullfill a single task well. It's a black box that fullfills a business service. Slower generic apps are simply the cost of doing business with it.
Beyond that, mainstream apps like this typically take up the entire computer and desktop; there is no room for other applications.. You would have a seperate machine if you wanted to do general purpose work-station operations. Perhaps the average Matlab user might be more pressed for all-around performance, but they're on a UNIX machine anyway.
And as for a 4,000 machine.... Well, I shouldn't have to mention how specialized a program is going to be for that anyway.
The point is that Intel wanted to compete against SUN and friends, which use a totally different business paradigm which is incompatible with most users (including value-basd work-stations).
The only danger, as I see it, will be the loss of "trickle down hardware". Where the state of the art today becomes the value PC tomorrow. I doubt that Intel will have _any_ incentive for making an IA-64 cost effective (since that would make it harder to justify the 3,000% premium they'll likely charge for the additional 100MHZ). Since we're locked out of the market (for at least 3 years), Intel will have to continue developing the x86 line for many years.
The problem will be that they can't bet the farm on the IA-64; they have to keep the PXXX on top. Yet, they're not going to re-engineer the x86 for 64 bits, since that would undermine IA-64. If all I need is 64bits, and I'm not worried about massive multi-CPU or even changing my compiler tool-set, then why should I choose other than AMD's x86-64 or the equivalent Pentium derivative.. Why pay the premium for IA-64? If AMD successfully converts their entire line to x86-64, and MS comes around and produces a compatible OS, there will be no compelling need for vendors to port to IA-64 (since there will be little compelling need for a user to buy outside of tradition). Yes they'll get performance... But it would be just like switching to SUN, with their proprietary supported hardware and software environment, AND more importantly, their smaller user-base. Should Sybase support yet another architecture for that 1-10% additional market? Oracle most definately, but maybe not a smaller App company.
The point is that IA-64 requires just as massive a change for software developers and users as if they were to switch to Alpha (especially since they run NT and emulate x86 code as well). What do I gain by choosing IA-64 as the platform? Reliability is offset by the newness of the system. Scalability can almost be better handled through clusters (and more cheaply at that). 64bit will become ubiquitous, and ironically, Intel will have the only 32bit processor left in the market (with the possible exception of older Macs).
All I'm saying is that Intel had best have a backup plan. There is one thing in Intels favor, however. Politics.. They still have the clout with the major industries, and they still can coerce MS to shift in their direction. MS has stopped to purposefully breaking compatibility with competition, and moreso, they are excelent and artificially creating demand for higher end system. With win-2k having an enterprise solution, it isn't too big a streatch of the imagination to concieve that there will be an IA-64 only varient that has necessary features that simply aren't offered on other platforms. Who knows, it might even be easier to develop these enerprise solutions on IA-64 than on legacy x86[-64].
Intel will not fail.. But they will not succeed on merit alone. (much like Rambus)
-Michael
-Michael
In other cases, I'd agree that legacy code performance would be a huge issue for a processor family aimed at the desktop. After all, there are so many thousands of apps that businesses and consumers rely on (some of which were written by companies that have long since died) that we couldn't possibly expect all of them to port to IA-64. Even worse, this might not be a simple recompile -- if you use any assembly or (more likely) if your code isn't 64-bit clean, you need to modify your code pretty carefully to support it.
But, luckily for them, Intel isn't targetting desktops. They're going after the very highest-end markets (especially with the first release) where users either own the code they're using (as with scientific/high performance computing) or where they rely on only one or two enterprise applications (look at the number of high-end boxes out there that basically just run Oracle, and the number of workstations that are used entirely for one CAD program). Intel just has to make sure that these key apps are really, really well-supported on IA-64 and their target customers will be happy. And they're basically paying companies to do this sort of porting (they have a $250 million IA-64 venture fund), so I have a lot of confidence that this'll work out for them.
It's also important to remember that enterprise products have a much longer purchasing cycle than consumer products. For any console system, the availability of games on Day 1 is crucial to the success of the whole system. But any reasonable enterprise can be expected to spend 9-18 months evaluating critical products before doing a serious roll-out, and that gives Intel a crucial buffer period in which to get the remaining ISVs on board.
The much tougher issue for them will be quality of the compilers themselves. The article alludes to the fact that IA-64 puts a LOT of burden on the compiler, but I think it even understates that fact. The standard gcc is woefully inadequate for this architecture, so Linux users have to hope that SGI's version comes through. Realistically, only HP (which has been working in VLIW experiments for years) can be counted on to have a good implementation ready from the launch of the chip.
--JRZ
No, because it will still be at lest half as fast as a register. That is exactley what Intel has done with the P4. Have you noticed the size of the L1 cache. It's 8k!!!. The Athlon has 128k! At first I thought Intel was employing a buch of doped up high school jocks, but it turns out that Intel has designed the P4 L1 cache to compensate for the IA-32's horrific lack of registers. But it still ain't gonna be as fast.
Numbers 31:17,18 Now kill all the boys. And kill every woman who has slept with a man,but save for yourselves every virg
Patience, patience! All in good time.
Seriously, I've read about this damn thing for years, and that they actually plan to roll the thing out I find nothing short of anticlamactic. By the way, has anyone seen a P4 commercial, yet? All I see are the Blue Men Group plugging the P3. Will there be a campaign for the Itanium? Big step that it would be, I think they have to.
Ignore that AMD immitator over there with the inexpensive, high performance SledgeHammer, which runs all your existing software! We have some real innovation for you, our track record proves...uh... well, trust us anyway, because Bill is doing Windows for us!
--
A feeling of having made the same mistake before: Deja Foobar
No, the schedulers needs to stay in because it is not possible to do all things in software. Things like register renaming and micro-op scheduling. The instruction set doesn't support it so you do it in hardware.
Actually, register renaming doesn't require out of order execution. All you're doing is renaming the second register in a write-after-write or write-after-read situation to a different internal register name.
You're right about micro-op scheduling, though. I was thinking about RISC processors, which already have more or less atomic instructions.
Part of the reason that schedulers on modern chips are so complex is that good compilers are rare. If the compiler produced optimally-ordered code, you could dispense with out-of-order execution and save a huge amount of silicon and effort. In practice, however, this kind of code is rare, so the scheduler stays in.
Remember the P4 vs. Athlon saga; it turned out that _both_ chips were running far below optimum performance due to sub-optimal compilation. Even without SSE2 enabled, Intel's compiler was able to produce a very large increase. Intel has a history of writing very good compilers, so it's possible that they'll be able to handle optimization for the Itanium consistently, but the vast majority of software developers don't shell out for the Intel compiler. Thus, most Itanium code will be sub-optimal.
This is the big problem with building really wide superscalar processors - it gets exponentially harder to extract parallelizable instructions from the serial program stream. The predication system helps Intel a lot here - by allowing them to pretend that they've predicted branches with certainty, thus optimizing them out and producing longer basic blocks - but it won't be magical. Beyond a certain point, which we're already starting to reach, it just stops being practical to try to issue more instructions in parallel from one instruction stream.
The caveat here is loops that repeat for a large number of iterations, known beforehand, without data dependencies between iterations. You can unroll these into reams of parallelizable instructions, and a large register set makes it much easier to do so. However, this turns out also to reach diminishing returns fairly quickly (play with -funroll-loops and -funroll-all-loops on a few test programs to see what I mean). Your processor bottlenecks on the (large) part of the program that isn't in an easily-scheduled tight loop.
Boy, will this increase context switch overhead. Part of the attraction to register renaming and a smaller visible register set is that you get much of the benefit of a larger register set without the context switching cost. Now, this can be taken too far (c.f. x86), but I suspect that 256 registers will be enough to substantially influence performance if you're doing something that involves switching a lot to perform relatively short tasks (like many kernel service calls, many driver calls, interrupts to transfer blocks of network data, etc.).
In short, I think that this processor tries to be too clever for its own good; my prediction is that it will burn lots of power executing both sides of branches, and run at far below peak issue rate due to poor compilers used by most of industry and the limited ILP that exists in the programs being run.
That having been said, there are a few things about this architecture that I _do_ like. Predication is one; speculating both sides of a branch requires a lot more silicon, but allows certain optimizations that just wouldn't be possible by any other means. The large visible register file is also nice for loop unrolling and software pipelining compiler optimizations, though it does cause overhead on systems with a lot of context switching.
My money's still on SMT processors (symmetrical multithreading; one core and one scheduler executing many instruction streams (threads or processes), which gives you more ILP for free, as well as free interleaving when needed to mask latencies).
So, would anyone out there who is currently using more then even 256 Megs of RAM tell me?
We have a machine with 128GB of RAM here. For many scientific apps you really do need that sort of capacity to deal with the size of the data sets used. If you're working with large databases in business or financial situations, I'd expect that much the same is true - you really want to be able to keep as much of your data in RAM as possible, and you really want to be able to perform complex manipulations.
And no, it's not just an NT thing. One of the more useful features of 2.4 is support for up to 64GB of RAM on IA32 systems. This is something that people want. There's more to life than the desktop, and there's more to servers than just throwing out static web content and processing mail.
I suppose any discussion of Intel will require the mention of AMD. While Intel has frequently admitted that this new chip will run non-native (i.e. not explictily compiled for it) code slower than current chips, AMD claims their 64-bit processor will actually run it faster through a smoother translation layer.
The question is, will developers jump on board and start recompiling? It's not as simple for other OS's as it is for Linux since the code is not available for you to do it personally.
If this chip actually runs code slower, and suffers poor backwards compatibility, what motivation is there for people to port to it? I can see specialized apps, but until Windows 2000 or other popular, but closed source Server operating systems and applications are ported, it's just an academic processor.
I guess we'll have to see if Intel can get the developers excited; but based on my purely anecdotal survey of developers in my group of friends, there isn't a lot of excitement about anything Intel does anymore, especially not this chip.
* mention of Windows 2000 as a server Operating System in no way endorses that as a Good Idea(tm)
----------------- "I have a bone to pick, and a few to break." - Refused -------------------