Linus Has Harsh Words For Itanium
Anonymous Coward writes "As a follow up to the earlier story "Intel: No Rush to 64-bit Desktop"... In words that Intel are likely to be far from happy with, the Finnish luminary has stuck the boot into Itanium. His responses to some questions on processor architecture are sure to be music to AMD's ears. Linus, in an Inquirer interview concludes: "Code size matters. Price matters. Real world matters. And ia-64... falls flat on its face on ALL of these."" Of course, Linus works for a chip maker ;)
This is from the Linux-Kernel mailing list, not an Inquirer interview. Here is the post.
Seeing as Intel was (still is?) a major backer of Red Hat, I'd imagine that Red Hat's kernel hackers have already ported it and will see to it that support for Itanium makes it into the release kernel.
the fact that linus works for a chip maker doesnt really matter because he dosn't develop the chips. he gets paid there to develop the linux kernel.
"It's worth noting that Torvalds' employer, Transmeta, has licensed x86-64 so he is likely to have access to Hammer hardware." This sounds really interesting. Any ideas what it means?
Well, you kind of forgot how MS server 2003 doesn't have support for x86-64, but has support for Itanium II. Can't leave MS out of the fray ;)
But you can always download Debian for the ia-64 architecture for free...
--
Libation.com - Fine wine and beer
If I recall correctly the Crusoe processor is 128bit . It is simply executing 32bit code through "code morphing"
Mickey-mouse == poor quality, inconsistent
Outfit == organization, company.
Free Java games for your phone: Tontie, Sokoban
Mandrake also has an ia64 port of their distrobution. I just installed it a few weeks ago (version 8.1 I believe).
RedHat does as well, but their installer would lock up at the end of the install every time, with no errors in the install log. I installed Mandrake after I could not get RedHat to install.
This was on a first generation (lion) itanium.
If you had nuts on your chin, would they be chin nuts?
What the hell are you smoking? I want some.
Every risc archeticture with the exception of the sparc3 performs better. Especially IBM's power4 and the upcomming power5.
Also there is more then speed when comparing architectures. Itanium is a terrible platform to write compilers for. Alot of optimizations which are tradionally done in the chip at runtime itself must be set by compiler options. Not all of it can be done efficiently like this.
Speedwise Alpha is getting old now but still is the fastest chip around untill the power5 comes out this fall. For coding and optimization, Mips is the best cpu around.
http://saveie6.com/
Well, the only reason why the other registers aren't GP on x86 is that there are instructions that use them implicitly. If you don't care about these instructions you can use them as regular registers.
As an example the EDI register is used by the SCAS* instructions as a pointer to memory. If you don't care about the instructions that use this register like that you're free to do regular operations on the EDI register, it has no limitations on what you can do with it.
You're right to say that there are few registers though. Before I learned x86 I learned MIPS and there you got all the glory of 28+ GP registers. In the simple examples we did I never needed to push and pop from the stack.
Pedro Côrte-Real.
Code size matters because *cache* isn't cheap. Worse, you can't make L1 cache arbitrarily fast without slowing down your chip big time.
Number 2 (make cache bigger) is easier said than done, and works against number 1 (cost).
"It's overkill, of course. But you can never have too much overkill." - Anonymous Slashdot Coward
Check the latest SPEC CPU benchmarks. The Itanium2 has the fastest floating-point score and is no slouch in the integer tests either. It will improve. Linus will eat his words in a few years.
There are 32-bit x86 processors that can handle more than 32-bit memory addressing, the Intel Xeon processors come to mind (which can address up to 36-bits)... the only problem is that the application and OS needs to support windowing or PAE (Physical Address Extension) to allow use of > 4GB of memory.
The only problem with windowing and use of PAE is that there is a long delay (from the processor's point of view) to shift the window compared to accessing something within that window. On the bright side, the delay isn't nowhere as bad as having to go to virtual memory and paging files.
[AMD] Was recently considering leaving the CPU business altogether
Uh.. what? AMD can't leave the CPU business. That would leave them with.. Flash memory. We all know how much revenue that brings in for them.
You have any links to support this claim?
Dacels Jewelers can't be trusted.
AMD Was recently considering leaving the CPU business altogether."
Um when was that? The only thing I recall was a Slashdot article with a misleading headline...
The read from theinquirer.net is all wrong. The slashdot story line is also wrong. It does not state at all what it implies. Here is the link to what Linus actually wrote:
3 02 .2/1909.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0
Now, I agree with Linus on the PPC MMU issue. Can anyone tell me what he means by "baroque instruction encoding"? I have been doing x86 and 68k assembler for a long time, I have never heard of this.
Enjoy,
It's just the normal noises in here.
What you're missing is that x86 chips have a ginormous amount of internal rename registers (128 in a P4). The bump to 16 *visible* registers in the Athlon-64 is to allow the compiler optimizer to give more information to the CPU about variable usage. I'm guessing that AMD found that more than 16 visible GPRs really didn't help the compiler's allocation routines any.
A deep unwavering belief is a sure sign you're missing something...
MIPS is behind Itanium in performance. HP-PA is behind Itanium in performance. SPARC3 is behind Itanium in performance. SPARC64V is behind Itanium in performance. Alpha has higher specint but lower specfp. Power has higher specint but lower specfp. Both major current IA32 processors have higher specint, but they are slaughtered on specfp.
That's without even mentioning TPC or Java benchmarks which make Itanium look just as good or better.
While the systems can use 4GB of RAM, applications can't have the entire 4GB. RAM must be split into two segments - OS and Apps - usually a 2/2 or 1/3 split.
The SPEC benchmarks are real-world. That's the point of them, and they've been used over the last 10 years to judge the real performance of a processor.
Don't forget that Itaniums are clocked far lower than P4's. The difference is that Intel doesn't plan on marketing 64bit chips to the consumer for a couple years, while AMD has their sights set earlier due to the expected lifespan on the Athlon-family and that their future is bet on 64bits.
I guess the main thing to note is that the P4 will be around for a least two years longer, where you can't say the same thing about Athlon family, at least at the high end.
Also coming into the picture is that Apple may have 64 bit workstations in ~ a year.
I suppose I'm not too threatening, presently, but wait till I start Nautilus
ia64 is in the mainline kernel. At least Debian and Red Hat have released, stable distributions for it. Red Hat even sells support for it.
ia64 is "in there" as much as alpha and sparc, even if it isn't quite as well tested.
NT was built on the i860 first, then ported to the i386 arch. More accurately, MS engineers emulated the i860 untill the chip was ready.
.
MS did this to make their new OS more or less platform independant. They didn't want to get 'stuck' on the x86.
Slashdot story here . Article here
Huh?
The second highest rated TPC box in the world is running Itaniums...
t s. asp?resulttype=noncluster
http://www.tpc.org/tpcc/results/tpcc_perf_resul
guess what? CAD users have been the driving force in high-end workstations for quite a while now....current machines still aren't sufficient enough to do near-photorealistic design in real time....and they won't be for anytime soon. Untill then, this niche market (if I'm an anomoly, why is Autodesk such a huge developer and Microsoft's biggest supporter?) will continue demanding better
I'm out of my mind right now, but feel free to leave a message.....
Hrm I'm running at least 2 on a workstation (4 512 meg sticks) and as much as I can cram into a server (PIII servers have 1 gig chips cheap enough) Cmon a modern video card has 128 megs of ram on it with the exception of RamBus ram is cheap comparitivly. Run under Linux and Ram can be one of the largest speed ups out there I runn about a 1 gig used memory heap on my workstation and another 2 gigs that ends up beind drive cache for a mid size scsi raid and that cache makes all the difference in the long run.
No sir I dont like it.
"lay waste" means precisely the opposite of "lie in ruins" -- it means "inflict devastation" / "inflict devastation upon".
The second rendering is because you can use the phrase transitively: "Caesar lay waste the village". (Not lay waste upon or lay waste to.)
I just thought I'd point that out. Your sentence as quoted doesn't make sense.
I know it's not very nerd-like to say that Linus is wrong and that AMD sucks, but in the case of the Itanium, that is exactly how I feel. Intel/HP's Itanium architecture is perhaps the most advanced processor to hit the market and has tremendous potential (from a Computer Architecture point of view). Because it's so new, its performance will be aweful, but shall improve with time. Anyone remember the SuperSparc? It performed horribly and was soon replaced by the UltraSparc. As will the Itanium II replace the Itanium.
As for the emulation/legacy code argument, I say screw it. gcc is already ported to IA-64. And as a Linux user, most of my favorite open source programs can be ported with little difficulty.
They have so many pins because it is not a single cpu. It is an MCM (multi-chip-module). Each Processor "Brick" contains up to 8 CPU cores.
Current draw is around 250-300A for an MCM. Alot? Hell yeah, but your average athlon XP pulls about 35A. 8 x 35 = 280.
So, not so big a difference.
Early on the chief advantage of the approach was that you could use the freed silicon for things like extra registers, and that's exactly the approach taken by Acorn (now ARM) and the PowerPC range. Would you prefer to have eight registers and a single byte copy-block instruction, or 64 registers and have to replace that copy-block instruction with (*gasp*) three simpler instructions?
(Actually, I guess that depends on how good your cache is. There's no such thing as a free lunch)
You are not alone. This is not normal. None of this is normal.
64-bit code needs 8 bytes to hold every pointer. This will serve to eat up more cache and memory bandwidth, which are already major bottlenecks for any CPU.
The only thing this eats up is cache; because the system has a correspondingly wider data bus, there isn't a hit in memory bandwidth (unless the designers are trying to be cheap bastards and give a 64-bit CPU the same data bus width you'd use for a 32-bit CPU). And most 64-bit CPUs have a lot of cache.
And as for what kind of applications you potentially need several gigabytes worth of memory for, there's scientific processing and the like.
Ever since the 8086/8088 duo, the bus width of a CPU has been decoupled from its word size. For a long time, the external bus width of (non Rambus) 32-bit CPUs has been wider than 32 bits. This works because the memory unit fetches entire cache lines. The CPU designers could be less cheap bastards today and bring out 32-bit CPUs with 256-bit wide busses if they wanted to.
And most 64-bit CPUs have a lot of cache.
You could put a lot of cache in a 32-bit CPU. You could put a small cache in a 64-bit CPU. In fact, the biggest difference between high-end and low-end CPUs is just the size of their caches.
To be fair, the current Itanium has an enormous cache that uses the vast majority of the die size and dicates its price and power consumption. It's logic core really isn't that big. If you embedded an X86 core in all of that cache, you'd get a very fast chip. If you teamed up an Itanium core in a Celeron cache, you'd get Celeron-level performance. 64 bits has little to do with it; you're mostly paying for cache and bandwidth when you buy high end CPUs.
Torvalds wrote that Intel had made the same mistakes "that everybody else did 15 years ago"
when RISC architecture was first appearing.
RISC first showed up on the commercial radar screen almost twenty years when MIPS Computer Systems
was formed. But people at Stanford (and Berkeley, IIRC) had been publishing papers about
RISC for four or five years before that, and people at IBM were working on it even before that.
And the CDC 6600 was a RISC machine in the 1960s. If you don't believe me, ask Cray's Chief Scientist Burton Smith.
In seeking the unattainable, simplicity only gets in the way. -- Alan Perlis
No electrons were harmed creating this post, though some may have been subjected to electrical and/or magnetic fields.
I attended an information session by someone from AMD at UCB. It was my understanding from his presentation that the tricks they were using to get up to 16 registers without compromising the ability to run existing 32-bit code made it impossible to get past 16 registers.
They would've liked to have 32 registers, but it simply couldn't be done in a backward-compatible way.
If you want more information on this, and more than a guess, AMD has much information up on its website.
Who is Itanium good for? Who is G4 or Power4 good for? What is X86 good for?
That's like asking what is a saw, hammer and screwdriver good for...they each have an application.
All these architectures have their good points and bad points. I've written sparc and x86 assembler and I can't say that they are better or worse than each other....just different.
At this point the hardware is MOOT. Unless algorithms get significantly better soon, the hardware won't matter. Sure, we'll get mega memory address space with any 64-bit architecture, but what does that get you? More memory address space? Big deal...so you've got big memory space...that won't make NP=P any time soon.
-ted
The pentium architecture has been loading 64 bits of memory at a time since the PII. They have to because that is the only way the RAM has a chance in hell of keeping up with the processor. Basically they load 2 instructions at once, and have them execute at double the speed of the RAM. (That's also part of why you get such a kick in the pants when you optimize with the -mcpu=i686 flag in gcc.)
"Learning is not compulsory... neither is survival."
--Dr.W.Edwards Deming
It looks like all the big Linux distributions have gotten together to support the IA-64 Linux development.. This was the first hit on search "Linux IA-64" on google.
:)
http://www.linuxia64.org/
Working distributions date back from 03/2000
Straight from their page:
IA-64 Linux Distributions
# Caldera Systems (initial release 8/4/00) Download at ftp.caldera.com/pub/OpenLinux64
# Debian (initial release 8/10/01) Download at www.debian.org/ports/ia64
# Red Hat (initial release 5/17/00) Download at ftp.redhat.com/pub/redhat/ia64
# SuSE (initial release 6/13/00) Download at ftp.suse.com/pub/suse/ia64
# TurboLinux (initial release 3/13/00) Download at www.turbolinux.com/ia64.html
Their short list of representative companies include: Caldera Systems, CERN, Debian, Hewlett Packard, IBM, Intel, Linuxcare, NEC, Red Hat, SGI, SuSE, TurboLinux, and VA Linux Systems.
If you search their site, you'll see a few emails from Linus in their mailing list archives, so he's obviously involved at least to a degree (I couldn't imagine him not being involved). I dare say he's educated in the matter, and would know all the in's and out's of say putting together an OS.
I'm sure support will be included eventually.. Well, maybe not.. I know Linux will run on SGI, DEC Alpha, ARM (I'm running Linux on a Compaq iPaq with an ARM CPU), so maybe they'll leave it as a patch and let folks do seperate distributions.
I guess it's all in how widely used a processor is.. Not the average Joe has an SGI, Alpha, or Itanium at their house. (I'll keep quiet about the 150Mhz SGI Indy that we use as a doorstop).
Serious? Seriousness is well above my pay grade.
if you want to allow fast syscalls (trust me, you do) you need to keep the kernel mapped permanently to cheapen the context switch from app to kernel. You also probbly want to separate physical memory (mapped into the kernel space directly) and virtual space, so that you can have swap and mmap'ed files. You also probably need to keep some address space to map I/O devices into, and some for DMA buffers (unless you really want to give up DMA to get that last memory). What with all the memory on modern video cards, mapping them (to say nothing of the AGP window) is pretty huge too.
So, unless you want to rewrite a lot of stuff, and throw performance completely down the toilet, you need most of that 4GB address space for things other than app VM space. The current linux split is 1G/3G (1 gig to map physical ram into the kernel and store kernel data structures), 3G of total app address space into which devices, files, swap, or physical ram pages can get mapped. You can also set linxu up for 2/2 I think (which gives you more physical ram at the expense of what each app can use) or 4G/PAE (which takes the performance hit and separates the app and kernel the apps get all 4G themselves, and the kernel uses PAE to map up to 32G in a separate way). But the performance hit is very significant unless your app uses almost no system calls or I/O (device I/O has to get copied around into lowmem for this case).
The Matrix is going down for reboot now! Stopping reality: OK. The system is halted.
The main point about the VAX arcitecture is that there was very close liasion between the OS architects and the hardware developers, the result being a secure operating system that worked well with little reources.
Interestingly enough, VMS did get ported to the Alpha, and some of the OS level MACRO-32 assembler code ended up being compiled for the Alpha. Some of the biggest apps still run on OpenVMS Alpha, and I await with trepidation the port to Itanium.
See my journal, I write things there
There are two issues here:
1. There is no difference in the speed it takes to transfer data, because the bus is wider. There is also no difference in the time it takes to process data, because registers are also wider. There is a decrease in cache performance (because addresses take up more space). All other things (CPU design, clock speed, etc.) being equal, this hit would be of about 5%. It would only apply to programs running in 64-bit mode, though (the Hammer can still run in 32-bit mode, and can use 8, 16 and 32-bit pointers even in 64-bit mode, in certain instructions).
2. AMD's x86-64 Hammer doesn't just increase the register size to 64 bits. It adds several new registers, that can (with minor adjustments in the compilers) give a pretty good speed improvement (I'd say about 10% for the same clock speed, although this will depend a lot on the specific program). It also improves the prefetch and adds SSE2 support (one of the few areas where the P4 has an edge). This should give the Hammer approximately a 20-25% improvement over an Athlon XP at the same clock speed (more, if SSE2 is used).
RMN
~~~
Go here for a really good summary of current CPUs.
The Internet is full. Go Away!!!
Ask anyone who has done assembly language programming on x86 and a decent CISC and x86 will always lose out too.
But the x86 has evolved a lot since the bad old days. You could regard the ugly stuff as vestiges of a primitive form and stick to saner modes.
A larger code size can be a significant disadvantage nowadays. Imagine CISC as compressed RISC opcodes. The current situation is the CPU is VERY much faster than the RAM or even the 2nd level cache. So it's not a big deal to have to decompress (decode/expand to RISC) instructions in the CPU. You gain overall processing throughput that way.
As long as that situation remains, larger code size is a significant issue. It means fewer programs in memory.
True RISC processors you talk about are declining. Most are becoming more pragmatic. Which is what Linus is talking about.
L. O. L.
The "Pentium architecture" is 3 completely separate implementations of the IA32 (32-bit x86) architecture:
- P5 (Pentium, Pentium MMX)
- P6 (PentiumPro, Pentium II, Pentium III)
- P7 (Pentium IV)
Each generation is as different from the others as 386 was from 486. One thing all the "Pentium" implementations share in common (aside from the catchy trademarkable name) is a 64 bit data bus. "i686" = P6, and optimizing for it only gives you a "kick in the pants" on P6 CPUs. It has little or nothing to do with the bus width of the Pentium chips; it's all about instruction selection and scheduling optimized for the particular (P6) implementation. That crap about loading "2 instructions at once" and "double the speed of RAM" is nonsense. You have to remember that all data come into the CPU through the caches, which are loaded 32 or more BYTES at a time from memory -- the wider bus just makes cache fills take fewer bus cycles. Alphas (64-bit) similarly had wide busses (128 or even 256 bit) for faster cache fills.
Itanium's problems were visible from the moment the architecture appeared. It is, and was, an architecture that should excel at running Fortran programs, which are much more easily optimized than code written in C, C++, or Java. Compilers written ten years ago should be able to do a decent job compiling Fortran to Itanium with only a modest amount of porting work. Problem is, people aren't just running Fortran on Itanium.
The apparently-dynamic nature of current programs (that is, the intractability of statically analyzing them) has been coming for years. Ten years ago I spent my time studying the inner loops of SPEC benchmarks, and even then the typical inner loop of a C program was the instructions:
compare X with a value
branch out if equal
load indirect through Y to get Y'
load indirect through Y' to get X
branch to top of loop.
If Y (and Y', and Y'', etc) don't address memory in cache, you're hosed. Static prediction algorithms used in some of the first RISC chips (HP-PA, e.g.) work as well as any other on this loop, but you don't know that you're done until you load all the data and compare it. The loop cannot run any faster per iteration than the latency of the memory that happens to hold the data (Cache is King).
Object oriented programming, whether accomplished with an OO-TM programming language, or just a structure full of function pointers, is about the same can of worms (internally, the processor is caching the last location of the indirect branch, so it is not substantially different from prediction of conditional branches).