Intel Reveals Itanium 2 Glitch
NeoChichiri writes "News.com is running on an article about glitches in Intel's Itanium 2 chips. Even though it doesn't affect all chips, they have still stopped shipments of the new 450 Servers until the problem is resolved. Apparently it has to be 'a specific set of operations in a specific sequence with specific data.' Intel is saying that affects the 900MHz and 1 GHz Itanium 2 chips and that it will not affect the upcoming 1.5 GHz Itanium 2 6M chips." Until the next iteration of chip arrives though, Oliver Wendell Jones writes, "they recommend working around the problem by underclocking the processor to run at 800 MHz instead of its default 900 MHz or 1 GHz."
Is it a glitch or did they sell chips that can't run at the rated speed?
Is this something that could be addressed by a microcode update? I've always wondered about exactly what can be done with the Kernel support for microcode updates.
On a side note -- who exactly didn't expect something like this? Intel has a history of this sort of thing -- from the 80486DX not being able to add properly, and IBM having to halt shipments of PS/2 machines; to the Pentium F00F bug and others. Buying first run Intel chips is like playing dice with your business. Give them a few production runs to work out the bugs...
Learning HOW to think is more important than learning WHAT to think.
Apparently it has to be 'a specific set of operations in a specific sequence with specific data.
This sounds similar to the way they described the floating point divide error in the original pentium. How long until they start giving odds on the chances of someone seeing the problem in normal use.
Jason
ProfQuotes
"they recommend working around the problem by underclocking the processor to run at 800 MHz instead of it's default 900 MHz or 1 GHz."
Why not just buy the lower-clocked CPU's then? Will Intel replace the crap chips when a revision with a fix comes around?
"If the customer feels it's the right solution, we'll exchange processors with ones that aren't affected," she said. Intel has developed a simple software test that can determine whether a chip is affected. Meaning what? Lower-end chips that aren't aaffected, or a fixed version of the same chip. If it's the same chip, who wouldn't think it is the right solution? The article doesn't indicate whether the problem is actually solved either, but that it seems to be somewhat of an anomaly that doesn't affect all chips.
Not a good day for Intel, and probably another reason why you don't immediately need that "Newest on the shelf" CPU, whether for your home machine or a server. Besides, by the time this chip is assuredly fixed, a faster revision will probably be out at a comparable price.
When you consider all the bugs that come through in higher level programming where everything is object oriented and human readable, it really comes as a surprise that you don't see more bugs in hardware considering the complexity of the problem and low level nature.
He who knows not and knows he knows not is a wise man. He who knows not and knows not he knows not is a fool.
Does anyone else find it ironic that when Intel makes one mistake in a processor, everyone jumps on them for making a bad product, but software companies can sell products with thousands of bugs in them and people accept this as normal? Sure, we complain about buggy software, but I don't think anyone here expects any software to be completely bug-free. Why are Intel and other chip manufacturers held to such a high standard? Or, more importantly, why are software companies not held to the same high standards?. If Intel and AMD can make incredibly complex processors that are (usually) completely bug-free, why can't any software company in the world make any product that even comes close to being free of defects?
Disclaimer: The opinions expressed are not necessarily my own, as I've not yet had my medication today.
who exactly didn't expect something like this? Intel has a history of this sort of thing
Of course when it happens to Intel, then EVERYBODY knows about it. My question is, how prevelant is this sort of thing throughout the cpu industry? Anyone know of other "mistakes" by the other major players? It's hard to imagine that only Intel makes these kinds of goofs, esp. with the complexity of todays chips. As an example, wouldn't Mot's failure to scale up the G4 PPC chips be considered an "error"? They just caught it early enough to not to ship any chips and say "oh, we're sorry, our G4's won't go as fast as we originally stated, wait another year and a half or so and we'll get it all sorted out". Didn't they also do a similar thing with the 68040?
Detected data corruption can possibly be fixed. Such as a single bit error on an ECC protected memory bus. And if the ECC error is a multibit error and unfixable, the OS is notified that the read got corrupted, and the OS is free to try again or shutdown or do whatever it thinks is reasonable. But that's better than say, getting the wrong stock buy order for a financial transaction.
So, to a certain extent, detected data corruption is fine, because server class chipsets are designed to be able to correct or handle some types of mistakes.
Itanium is a very new architecture. It has the potential for kicking i386 chips in the butt once it has a chance to grow up. With anything as radicaly new as the Itanium, there is a high probability of unexpected problems. AMD has not had this sort of problem resently because they don't have any balls. All they ever do basicaly amounts to minor tweeks of a stable design. Even their 64 bit extensions fall into this catagory.
The type of problem Intel is dealing with could very well be in a new class. I have a hunch that it has to due with either unexpected capacitive coupling ( possibly related to an in-spec extreme of the process variation) or thermal transients causing timing skew. These types of phenomena are nearly impossible to model, especial if its tied to a particular set of process deviations. That is why manufacturer do such extensive qualification testing. Unfortunatly this testing can not be done untill there are enough units to test ( like in the 1000s). This does not happen untill the device is ready for production. Technicaly, this is the Pilot phase of development.
One needs to give Intel some credit for learning a lesson from the Pentium fiascos ( not just the math error, but also the original ( 5V) 90Mhz burn-up issue). At least they are doing the right thing now. Corporations, like people, sometimes need to learn the hard way. Unfortunatly, though people usually retain their lessons, Corporations sometimes need to relearn them, especialy when being run by greedy BODs ( or board members with hidden agendas). AMD has yet to learn this particular lesson. One of these days, they will try to cover up a problem and its not going to work. They have gotten away with some stuff already because everyone loves to hate Intel ( me included, 68000 and PowerPC for me!)
Unless your familiar with LSI semiconductor manufacturing, you should not be commenting. Because you don't have a clue as to what is going on. The posts I've read so far, remind me of what a class of 10 year olds would right in criticing Joseph Conrads "Heart of Darkness".