Intel's Itanium CPUs, Once a Play For 64-bit Servers And Desktops, Are Dead (arstechnica.com)
Reader WheezyJoe writes: Four new 9700-series Itanium CPUs will be the last and final Itaniums Intel will ship. For those who might have forgotten, Itanium and its IA-64 architecture was intended to be Intel's successor to 32-bit i386 architecture back in the early 2000's. Developed in conjunction with HP, IA-64 used a new architecture developed at HP that, while capable as a server platform, was not backward-compatible with i386 and required emulation to run i386-compiled software. With the release of AMD's Opteron in 2003 featuring their alternative, fully backward-compatible X86-64 architecture, interest in Itanium fell, and Intel eventually adopted AMD's technology for its own chips and X86-64 is now dominant today. In spite of this, Itanium continued to be made and sold for the server market, supported in part by an agreement with HP. With that deal expiring this year, these new Itaniums will be Intel's last.
Intel managed to kill of the Alpha and the PA-RISC with the Itanium. MIPS left the high end. SPARC and POWER looked quite endangered at one point.
All this was done while not being actually competitive at any point. You have to admire that, in an Evil Corp sort of way.
Finally! A year of moderation! Ready for 2019?
I think there's a bubble effect happening with a lot of people, they can only a limited viewpoint based on seeing a monoculture for so long. Hearing these sorts of questions is sort of like heaing someone ask why Chevrolet exists when we already have Ford. They've only ever seen and used a PC perhaps and only vaguely know of other types of computers, and certainly have no background in computer architecture.
Myself I have always mourned that Motorola never could increase the frequency of the MC680x0 beyond 66Mhz and keep up with Intel because that architecture was a real beauty to program in assembler.
The features that made the 68k so nice for assembly programming, like addressing multiple unaligned memory locations in a single instruction, are precisely what made it so difficult to speed up.
Imagine that you are a chip designer. You are implementing the silicon for "movl @a1, @a2". So you access the first byte at the address stored in a1, but you get a page fault. You trigger an interrupt, the OS swaps in the page, and then returns from the interrupt. Now, you must restart the instruction and fetch the other 3 bytes, but since the address is not aligned, they can be on a different page, so now you trigger another interrupt. The OS returns, and you can now fetch the 3 bytes. But wait a sec, what about the 1st byte? You can either go back and fetch it again, which might trigger yet another interrupt since that page may no longer be in memory, or you can have some extra "hidden" registers to hold that intermediate value (which can be one, two, or three bytes). So you deal with all that. FINALLY you have all 4 bytes. Whew. Now you need to do the SAME thing for the second operand, but that is even more complicated, because if the address straddles a page boundary you may end up with corrupted memory if one byte of the target is modified but the other bytes are not.
A single instruction can generate up to 4 pages faults (not including double faults). Now how do you deal with cache coherency on a multi-core system when a single operand can be partially read from or written to memory? Can you imagine all the silicon required to handle that?
Eliminating complex instructions like this is precisely what makes RISC fast. A RISC instruction can load from memory, or store in memory, but never both, and unaligned addresses are usually a fatal error. There is never any dangling state to deal with.
Instruction sets should be designed for compilers, not humans.
Is it just because they have the best fabs or designers or the most experience?
This is a big part of it. Intel has a huge market and can spread high NRE over millions of units. So they can devote a lot of engineering resources to each design iteration, and their fabs are fabulous (sorry).
- x86 has a pretty expressive instruction set which improves code density enough that more code can fit in cache than RISC-y architectures
This is only partly true. Original x86 code was dense, but extensions were tacked on that add prefixes to allow extended precision and extra registers. These made the code less dense.
Code has much better locality of reference than data, since a lot of time is spent in tight loops. So it actually isn't super important to have a really big code cache. The cache can be Harvard Architecture, so code and data are cached separately, and the code cache is read-only. Current x86 CPUs use separate "Harvard" cache for L1 (the one closest to the metal) and use a unified cache for L2 and L3.
Intel couldn't figure out how to make a compiler good enough
One of the early RISC principles espoused by Dave Patterson was that the chip and the compiler should be developed simultaneously. If the compiler guys need a key instruction, then the silicon guys add it to the design. If an instruction is rarely used by the compiler, then it should be eliminated and replaced in software with a slower (but infrequent) sequence.
A big mistake with Itanium, was that this didn't happen. The hardware was done first, and then the design was thrown over the wall to the compiler team. Instead there should have only been ONE team with compiler and hardware guys working side-by-side.
Another big mistake was that Intel tried to monopolize the compiler market. Only their own rather expensive compiler produced acceptable code. But at the time it was released, the server market was rapidly moving to Linux/FreeBSD and gcc was the FREE pre-installed default compiler. Not many people wanted to pay extra for a compiler. Intel would have been much better off if they teamed up with the FSF and worked hand-in-hand to have a gcc port working on day one, and using feedback from the gcc designers to influence the hardware design as well.
AMD did exactly this with their x86-64 design. They hired gcc developers and they worked closely with the hardware designers. When the chip was released, gcc was ready, and both Linux and FreeBSD were able to boot up and run on some of the very first systems.