Linus Has Harsh Words For Itanium
Anonymous Coward writes "As a follow up to the earlier story "Intel: No Rush to 64-bit Desktop"... In words that Intel are likely to be far from happy with, the Finnish luminary has stuck the boot into Itanium. His responses to some questions on processor architecture are sure to be music to AMD's ears. Linus, in an Inquirer interview concludes: "Code size matters. Price matters. Real world matters. And ia-64... falls flat on its face on ALL of these."" Of course, Linus works for a chip maker ;)
Now, we all know that the Itanium isn't everything it's cracked up to be, and I think none of us at are wrong in blaming intel for coming out with a lousy product....
But, isn't one of those situations he mentions in the interview (namely, running a large database server) what this chip is designed to be doing?
As I recall, the IA64 isn't designed for the desktop user... In fact, desktop users probably don't even need 64 processing for a number of years still....
Yet we're attacking Intel for making the chip to fit it's niche?
Perhaps we need to be more fair in the context of the usefulness of the chip, instead of considering it in all contexts and criticizing it based on that?
Linus being opinionated and brash? Never!
"but speaking strictly from the technological point of view"
I think that was his point. It's great technically but it sucks in the real world. If its not practical its a shitty architecture IMHO.
I also think the x86-64 is a more viable solution as well.
----
Go canucks, habs, and sens!
Now I'm no programming guru, but it seems to me that the x86-64 architecture is a great one. In fact, the only thing that I could see being done to improve it would be to add more general purpose registers. I believe that the new registers are all GP (IIRC), but I think that makeing them ALL GP (even the older ones) would be good, and maybe bring up the number of registers to a good round 32 or something. Am I missing something glaring wrong? If you're going to toss out all of the x86 stuff (like ia-64), I think you should be able to emulate it in hardware about as fast as current x86 processors can. When Apple switched to PPC, couldn't they emulate 68k code about as fast (or at least faster than 1/2 the speed of) the fastest 68k chips?
Comment forecast: Bits of genius surrounded by a sea of mediocrity.
The best architecture is still VAX. Clearly string operations at the processor levels is what any procesor needs to be the best and fastest ;}
Worse is better
although the original essay talks about Unix and the LISP machines, it just keeps being true. Linus talks about the "charming oddities", well there you go: worse is better. Try for perfection, and the real world will eat you alive.
I also think he's right about the masses being what matter; I think Intel is still thinking about the data centre, not Joe Sixpack, with Itanium.
ZOMG I WOULD LOVE TO KNOW ABOUT YOUR FEELINGS ON MACINTOSH VERSUS WINDOWS, VI VERSUS EMACS, AND HOW YOU'RE NOT A DORK
Of course, Linus works for a chip maker
And if trends continue, it could be Old Dutch.
So he is more likely to know what he's talking about.
Personally, I'm getting a bit tired of all the inane cynicism that passes for reflective commentary in modern society. While it's true that the world has its villians, it is more true that people often just hold opinions irrespective of their economic interest. I for one, trust that Linus is among these favored many.
(Not joking this time)
Sun has an interesting( biased) peace on Itanium. If I were buying a server I would avoid Itanium like the plauge. It is possible that Intel could even cancel the whole project and leave customers high and dry. Not to mention software availability is a problem.
I prefer the risc architecture. I like the idea of keeping things simple and efficient which is alot like structured programming. VLIW does not follow this ethic.
http://saveie6.com/
Netcraft confirms it: Itanium is dying.
One more crippling bombshell hit the already beleagured Itanium community when Slashdot confirmed that Linus thinks Intel dropped the ball with Itanium. Itanium now powers 0.00% or all servers. Coming on the heels of a Netcraft survey which plainly states that Itanium has gained absolutely NO market share. This reenforces what we've known all along: Itanium is collapsing in complete disarray.
You don't need to be a Kreskin to predict Itanium's future. The writing is on the wall: Itanium faces a bleak future, in fact there won't be any future at all because Itanium is dying. Intel has dumped millions into Itanium, red ink flows like a river of blood.
All major surveys show that Itanium has steadily held its ground at 0.00% use while millions of other processors are produced daily. If Itanium is to survive at all it will be among CPU dilettante dabblers and hangers-on. Nothing short of a miracle could save Itaniu, at this point in time. For all practical purposes, Itanium is dead.
Trolling is a art,
Mickey-mouse == poor quality, inconsistent
Outfit == organization, company.
Free Java games for your phone: Tontie, Sokoban
i have used several PCs/Sparcs and in all of them, before discarding, I have upgraded memory to 2x to 4x times originally it came with. We don't need more than 4 GB now; but if you were to buy a computer now, you need to make sure that you would be able to upgrade ram to 4x. This means if your need is greater than 1 GB, then 32 bit system is not suitable for you. At present, all my home and work machines (new ones) are ordered with 1 GB. This is at the border of 4x memory expansion possible with 32 bit system. So in a year or so, I will definitely not buy another 32 bit machine.
look here:
pricewatch
almost $3000 for the chip. wow, and for so many mhz, too...
--
"It is now safe to switch off your computer."
Actually that's a good question. I think chipmakers should slow down a bit and enjoy life. Perhaps meet halfway with a 48 bit chip...
Trolling is a art,
Code size matters because *cache* isn't cheap. Worse, you can't make L1 cache arbitrarily fast without slowing down your chip big time.
Check the latest SPEC CPU benchmarks. The Itanium2 has the fastest floating-point score and is no slouch in the integer tests either. It will improve. Linus will eat his words in a few years.
"Did anybody notice a 2X performance boost moving from Windows 3.1 on 16bit MS-DOS to a nominally 32bit Windows 95 OS?"
I did. There was so much less time in between crashes that I learned to move quickly!
Well, that fud filled anti-MS joke should earn me a Karma point or two.
The read from theinquirer.net is all wrong. The slashdot story line is also wrong. It does not state at all what it implies. Here is the link to what Linus actually wrote:
3 02 .2/1909.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0
Now, I agree with Linus on the PPC MMU issue. Can anyone tell me what he means by "baroque instruction encoding"? I have been doing x86 and 68k assembler for a long time, I have never heard of this.
Enjoy,
It's just the normal noises in here.
Linus isn't saying he won't let it in. He's simply saying that the thinks it's not a good arch based on technical merit. He'll let it in. He never said he wouldn't. He's just saying he doesn't like the way the chip was designed (what choices they made, etc).
Comment forecast: Bits of genius surrounded by a sea of mediocrity.
What are YOU smoking?
Optimizations done at compile time are far better than optimizations done at runtime. At compile time, more is known about the structure of the program, where the flow of the program will be going, and more time intensive optimizations can be done than ones done in realtime in the cpu.
Itanium is slower right now, but as compilers with optimizations tailored to it come out, it has the potential to kick every RISC processor's ass. The reason for this is that RISC processors are bogged down by doing the optimizations at runtime that Itanium doesn't have to care about. This means the Itanium will have the same or less stalls and more efficient use of the processor.
Go read up on compiler optimizations to see why cutting out the middleman of instruction sets is a good thing.
I think the problems with the Itanium boils down to this:
1. The CPU's are insanely expensive. They make the majority of x86-architecture Intel Xeon CPU's look like a bargain.
2. Where are the server applications that take advantage of the Itanium CPU? They're not exactly widely available, to say the least.
3. Programming for Itanium is still a somewhat iffy proposition.
Meanwhile, AMD's Athlon 64/Opteron offers these advantages:
1. The CPU will definitely NOT be insanely expensive to purchase.
2. Programming for the AMD x86-64 architecture is not going to require kiboshing a bunch of legacy programming tools and starting from scratch--it is a straightforward process to convert today's programming tools to take full advanratge of the x86-64 native mode.
3. Because the programming tools are so readily available, both operating systems and applications for the Athlon 64/Opteron will be available widely by the time the new AMD CPU's are finally released for sale. Already, UnitedLinux is porting Linux to run in x86-64 native mode, and Microsoft is very likely readying versions of Windows XP Home/Professional and Windows 2003 Server that will run in x86-64 native mode.
Meanwhile, Intel supposedly has a 64-bit x86-architecture CPU codenamed Yamhill that has developed. However, given we don't know how Yamhill implements 64-bit x86 instructions Intel will have to do some VERY serious convincing to Linux kernel programmers and to Microsoft to write Yamhill-native code--and Intel is far behind the AMD efforts.
Not only does he work for a chip maker, he's like totally obsessed with the i386 architecture. I guess it's what he cut his teeth and and he's going to stick with it. But to think that no-one else has a use for it is very short-sighted.
He works for a company that doesn't build chips with the i386 architecture. Its emulated in firmwear, "code morphing" is what they call it. Its slightly slower than hardware but its worth the trade for power consumption.
I am betting he has worked with plenty of morph code, creating virtual cpus, subsets of the i386 chip, or different completely. This is akin to designing hardware, in software.
I can't see how him working for Transmeta hurt his understanding of processors. Seems like it would actually enhance his understanding.
Tequila: It's not just for breakfast anymore!
AMD is the wildcard. If x86-64 is the bomb and takes off like AMD is betting on it. Intel lost the 64bit war for many years. IBM and maybe even Sun will quietly (well sun doesn't do jack shit quietly) push x86-64 for the low end while IBM POWER4 and POWER5 and POWER6 down the road run the big end.
Basically Intel needs something like Sun to jump on it IA64 to really give it some credibility and they don't sound real eager to. IBM sounds like they are down for the fight. Alpha, MIPS, PARISC are all pretty dead; long term and relatively speaking. Meanwhile, if Intel doesn't get on the shit quick then they'll have to support x86-64 too and that's the real death blow to IA64.
As time goes by, computer languages are trending towards more dynamic behavior. This tends to favor things like JIT compiling and linking into already running programs. Fewer people are going to be able to afford the luxury of spending hours to preprocess their code to fit into an extremely static ("explicitly parallel") hardware model. This will be especially true when chip makers treat their rocket science static compilers as a separate profit center.
Not to mention, the CPU is the one that is actually in the position to know what optimization is needed right now based on the currently running data set. Given that there is usually a several year lag between the latest CPU developments and widespread compiler support, I'd go for a CPU that knows how to do its own tricks. (Hasn't the Itanium architecture been nailed down for almost a decade now? And we're still waiting on better compilers for it?)
That doesn't mean it's the best solution. Merely the one that's going to win. Architecturally speaking, x86 is one of the biggest loads of crap to come along since...well...hmmm...I can't think of anything crappier off the top of my head.
Extreme register pressure. Segmentation models that make you want to retch. Hacks (PAE, anyone?) that leave any sane designer gibbering incoherently.
If you read the thead, Linus' main argument seems to be "to get good performance, all the other architectures have had to do complex things in hardware, so there's no real hardware simplification in going with a 'better' architectural design. Plus variable length opcodes are a natural cache optimization!"
I respect Linus a great deal, but he's talking out of his ass here. I agree that IA-64 may be best relegated to some academic's wet dream, but just about any of the major RISC architectures are big wins over x86. Intel and AMD have worked miracles with x86 to get it to run fast, but at a staggering engineering cost. The teams working on RISC chips tend to be a fraction of the size to come out with a high-performance chip. If the RISC houses had an engineering team of comperable size (and access to the same bleeding edge lithography processes) it would easily be worth an extra 25% in performance, minimum.
If you look in the embedded world, just about anything that requires serious embedded performance is RISC based (MIPS/ARM, mostly), simply because it decreases the engineering work involved by an order of magnitude. Plus, writing low level software for just about any RISC chip is loads easier than for x86.
Unfortunately, x86 is here to stay for the foreseeable future. Intel killed Alpha, not by buying it, but by doing a great job of pushing cheap x86 performance to the same level as Alpha, often surpassing it in later years. The same thing is happening to the other workstation-class RISC vendors, and, honestly, to Itanium, too. I don't see any reason to believe the march to x86 hemogeny outside the embedded world is likely to slow anytime soon.
Most home users are going to see a performance drop from 64 bits. 64-bit code needs 8 bytes to hold every pointer. This will serve to eat up more cache and memory bandwidth, which are already major bottlenecks for any CPU.
Unless you have a program that actually needs to work on more than 2G of data at one time, 64 bits buys you nothing but extra time waiting to move around millions extra of zeroed out upper bytes.
Some people need that much data in memory at once, but the average home user today doesn't. People typically mention video, but if there's one thing that's easy to stream in and out of a smaller memory space, it's multimedia data.
Linux made him ... oh wait nevermind.
Transmetta makes a lot of ... oops there I go again.
Intel is a company that time and time again proves it knows how to make money. It may not always support the crowds it should (like /. readers and superusers) but they are still making money.
Sure there are lots of difficulties going to a new ISA. Especially at the server level. And yes Itanium has had some performance problems, especially in its first revision, but then again when was the last time you saw a company produce a 1st generation microprocessor and have it do well?
IA64 offers tons of advanced ILP concepts and OS concepts that, when correctly implemented, can increase performance drastically. (if your looking for examples, data speculation, control speculation, predication, registers with kernel access only, rotating register files, a much larger register set, etc).
The problem may be, it puts a lot of complexity into the Compilers, and compiler technology isn't good enough for Itanium yet.
But then again, what do I know, Linus has made more money than I have. I just like arguing the other side while everyone else screams about how the Itanium will die.
First of all it is not very smart to try to reduce code size by putting complicated instructions in the processor architecture.
A succesfull architecture may be used for 20 years, and there is no way you can know which complex instructions will be most usefull/popular in several years. And when you start making upgraded chips for a design, these complex instructions will be a real pain in the ass.
The x86 architecture is a perfect example - it is a mess and many of its instructions are not used at all. The x86 is succesful because the way history played out - it was put on the first pcs, and the incredible numbers of precessors sold allowed intel to put more development money into that architecture than any body else was able to put into theirs. And large initial investments, and large sales numbers mean that individual chip prices can be lower.
Nevertheless, the alpha and some of sun's chips can still compete with intel in the server environment, with much smaller investments and worse production technology. That basicly shows the weakness of the x86 architecture.
When you have multiple pipelines and multiple stages per pipeline the size of your chip will grow exponentially to the number and complexity of your instuctions. Eventually adding more pipelines will be pointless and then you are reduced to adding cache as the only way you can improve your architecture.
For a Risc architecture, multiple pipelines will cost less overhead and more can be used. Processor performance can be increased by adding more pipelines without having to increase speed.
Intel has the money and the clout to make a succesful risc architecture. It is brave of them to do it, but from an engineering point of view it is the only right thing to do.
AMD will support x86 because they do not have the clout to force a new architecture on the world. It is a completely understandable policy, but then again will result in worse performance (unless their engineers are somehow much more brilliant than intel's).
Of course the real world matters and in the real world almost everyone uses x86. But if someone can change that it is intel.
Early on the chief advantage of the approach was that you could use the freed silicon for things like extra registers, and that's exactly the approach taken by Acorn (now ARM) and the PowerPC range. Would you prefer to have eight registers and a single byte copy-block instruction, or 64 registers and have to replace that copy-block instruction with (*gasp*) three simpler instructions?
(Actually, I guess that depends on how good your cache is. There's no such thing as a free lunch)
You are not alone. This is not normal. None of this is normal.
64-bit code needs 8 bytes to hold every pointer. This will serve to eat up more cache and memory bandwidth, which are already major bottlenecks for any CPU.
The only thing this eats up is cache; because the system has a correspondingly wider data bus, there isn't a hit in memory bandwidth (unless the designers are trying to be cheap bastards and give a 64-bit CPU the same data bus width you'd use for a 32-bit CPU). And most 64-bit CPUs have a lot of cache.
And as for what kind of applications you potentially need several gigabytes worth of memory for, there's scientific processing and the like.
Ever since the 8086/8088 duo, the bus width of a CPU has been decoupled from its word size. For a long time, the external bus width of (non Rambus) 32-bit CPUs has been wider than 32 bits. This works because the memory unit fetches entire cache lines. The CPU designers could be less cheap bastards today and bring out 32-bit CPUs with 256-bit wide busses if they wanted to.
And most 64-bit CPUs have a lot of cache.
You could put a lot of cache in a 32-bit CPU. You could put a small cache in a 64-bit CPU. In fact, the biggest difference between high-end and low-end CPUs is just the size of their caches.
To be fair, the current Itanium has an enormous cache that uses the vast majority of the die size and dicates its price and power consumption. It's logic core really isn't that big. If you embedded an X86 core in all of that cache, you'd get a very fast chip. If you teamed up an Itanium core in a Celeron cache, you'd get Celeron-level performance. 64 bits has little to do with it; you're mostly paying for cache and bandwidth when you buy high end CPUs.
Without Windows for x86-64, AMD is dead. No, Linux will not save it. However, the moment Microsoft releases Windows for x86-64, Itanic is history. The market will overwhelmingly favour x86-64 because of the much lower price (I expect at least 3-4 times lower, cosidering that the Itanic CPU alone sells for over $3000), and perfect backwards compatibility. Itanic's ia32 support is so pathetically slow that it may as well not exist, so a move to Itanic requires you to replace _all_ your software, which ain't cheap, while x86-64 allows you to do incremental upgrades. So, taking simple economics into account, Itanic will go the way of that ship and AMD will emerge the winner... provided there is a version of Windows for x86-64. Without that there is no point of talking about "64 bit desktop" market because it just won't exist. So what is Microsoft doing?
___
If you think big enough, you'll never have to do it.
The pentium architecture has been loading 64 bits of memory at a time since the PII. They have to because that is the only way the RAM has a chance in hell of keeping up with the processor. Basically they load 2 instructions at once, and have them execute at double the speed of the RAM. (That's also part of why you get such a kick in the pants when you optimize with the -mcpu=i686 flag in gcc.)
"Learning is not compulsory... neither is survival."
--Dr.W.Edwards Deming
Sorry while I rant, but you just stomped on one of my nerves. (Unless your comment about neededing that much RAM was a complaint about Adobe or their direct *cough* compeitors -- sucks to be you.)
<Old Geezer Mode> In one case, not long ago, a fellow lab-rat Eric Mortenson had sold his research and tools to Adobe, but part of the poorly-written agreement said that he couldn't upgrade his work station. So he finished his Ph.D on a 386 with 32-MB of RAM, while the rest of us in the lab were using Pentium 3's, DEC Alpha's, and various SGI boxs. Eric's algorithms ran great on the newer PC's even though he couldn't develop them on the new boxes. Other with Adobe (NOT on that web site interestingly enough) needed the DEC Alphas (64-bit machines) with scads of memory and much more running time to do a similar implementation of Eric's algorithms. </Old Geezer Mode>
3D rendering doesn't take that much RAM. As a 3D graphics researcher and developer, I have worked with models where individual objects were multi-gigabytes (meshes+textures and volumes) but even then, having 1GB of RAM was more than enough for us to reach 20-30 FPS realtime on a box with NT4 and first- and second-generation 3D cards. Software rendering with very realistic detail was a little slower (3-5 fps) but was fine for writing movies. Progressive geometry & texture transmission, continuously calculated view-dependant detail levels, and other current and not-so-current research would solve the memory problems in 3D. Don't believe me? Go to Visualization 2003 and see if the leading researchers are finding RAM as their primary bottleneck. It is a bottleneck of course, but processing speed, caches, and the system BUS limitations are far more troubling.
As for video editing, you only need enough memory for the tools, a few frames, and whatever operations you are performing. In every case that I've had to do video editing, I've seen two classes of tools -- those that take gobs of memory and try to copy the entire video clip into RAM and end up thrashing for memory -- and those that intellegently figure out what is needed and use only the memory needed for the app.
An example of the first, an Adobe AfterEffects rendering a simple math function over time was only able to render 30-seconds because it wanted to buffer the AVI file in memory and ran out of RAM (2GB) after a several-hour rendering. An example of the second, a simple home-brew compositor that used the Windows multimedia API to write the AVI to disk -- the same machine and the same set of images required about 45 minutes to render the entire clip.
So instead of saying:
I would suggest you say " I need to buy tools that are properly designed and implemented for my class of computer. "
Frob.
//TODO: Think of witty sig statement
It looks like all the big Linux distributions have gotten together to support the IA-64 Linux development.. This was the first hit on search "Linux IA-64" on google.
:)
http://www.linuxia64.org/
Working distributions date back from 03/2000
Straight from their page:
IA-64 Linux Distributions
# Caldera Systems (initial release 8/4/00) Download at ftp.caldera.com/pub/OpenLinux64
# Debian (initial release 8/10/01) Download at www.debian.org/ports/ia64
# Red Hat (initial release 5/17/00) Download at ftp.redhat.com/pub/redhat/ia64
# SuSE (initial release 6/13/00) Download at ftp.suse.com/pub/suse/ia64
# TurboLinux (initial release 3/13/00) Download at www.turbolinux.com/ia64.html
Their short list of representative companies include: Caldera Systems, CERN, Debian, Hewlett Packard, IBM, Intel, Linuxcare, NEC, Red Hat, SGI, SuSE, TurboLinux, and VA Linux Systems.
If you search their site, you'll see a few emails from Linus in their mailing list archives, so he's obviously involved at least to a degree (I couldn't imagine him not being involved). I dare say he's educated in the matter, and would know all the in's and out's of say putting together an OS.
I'm sure support will be included eventually.. Well, maybe not.. I know Linux will run on SGI, DEC Alpha, ARM (I'm running Linux on a Compaq iPaq with an ARM CPU), so maybe they'll leave it as a patch and let folks do seperate distributions.
I guess it's all in how widely used a processor is.. Not the average Joe has an SGI, Alpha, or Itanium at their house. (I'll keep quiet about the 150Mhz SGI Indy that we use as a doorstop).
Serious? Seriousness is well above my pay grade.
The key point about Itanium is that it is a horrible general purpose processor but it is a serious contender to be very good processor for supercomputing. It has very good floating point performance and the EPIC architecture is designed to be very good on Fortran, especially vectorizable Fortran which is very prevelent in HPC applications. What Linus said is correct in the context of Itanium as a general purpose processor, but its doesn't give Itanium the credit its due as a floating point supercomputer which is the only place its going to sell and is what it was designed for.
It will probably never be very good for most C and C++ apps. Pointer aliasing in particular will give the Itanium compiler fits. Unless you manually tell the compiler there are no two pointers accessing the same memory the compiler can't safely or effectively pack the parallel instructions in the VLIW and that is the essential to good performance in VLIW.
You do have to really question the sanity of some execs at Intel and HP for spending the staggering sums they've spent on Itanium. Supercomputing just isn't big enough a market for them to have any chance to recoup their investment in our lifetime and they aren't going to sell it in to the mass market as Linus said.
For a general purpose 64 bit processor to run existing C and C++ applications AMD is going to win hands down. But as many have noted its not likely most people are going to really need a 64 bit processor anytime soon so Intel will probably do just fine selling 32 bit x86 processors for a while.
@de_machina
Yes, RISC programs tend to be longer (sometimes considerably) than CISC programs. There are actually two (main) reasons for this. First, as you mentioned, you need to replace single instructions (usually ones that do memory-to-register operations) with multiple operations (such as a load followed by an math operation). Second, the instructions themselves tend to be longer, since most (all?) RISC archetectures have exclusively 4-byte opcodes - usually something like 1 byte for the opcode, a few bits for flags, and the remaining for up to three arguments. (register numbers or immediate values). CISC archetectures have varying length opcodes, some a signle byte and some several bytes.
There are a couple of mitigating factors here. First, compilers are usually not very good at using some of the more complicated combined instructions, so they go unused, inflating CISC code to match RISC code. Second, careful optimization of RISC code can identify repeated or unecessary memory operations, and eliminate them. When the memory operations are tied to the arithmetic (or other) operation, that is not possible. Finally, since RISC archetectures generally have large register files and all registers are equivelent, fewer operations are needed to shuffle bits around to where they can be worked on, whereas i386 has a lot of operations that only work on certain registers (though far less than earlier incarnations).
I used to be a big fan of Intel cutting life support on the 386 archetecture, because RISC is so obviously cleaner and nicer. However, I have started to believe the AMD hype about x86-64, which is basically along the lines Linus talks about here. RISC vs. CISC doesn't really matter any more, and the i386 archetecture is not so bad. If you A) add some more general purpose registers, B) eliminate most of the remaining register usage restrictions, and C) Ditch the worst (looking and performing) FPU on the market in favor of almost anything else, you have yourself a very servicable archetecture. Extend the registers and addressing to 64 bits, and you have something that has a lot more room to grow. That is what the x86-64 is, and despite all the rumors that Intel has their own 64 bit extension to x86, if they don't actually release soon people will start to adopt x86-64 and they will be in the unenviable posistion of having AMD dictate the future of their product line.
I have heard frequently that something like only 5% of the transistors on the PPro core were tied to the "i386ness" of the core. I assume with the P4 that number is even less. It seems then that the instruction set is not as big of a deal as we would like to think.
The thing that puzzles me about ia64 is: if the whole point is to "make the compiler do it", and none of the fancy instruction reordering is done in silicon, why is it so expensive?