Intel Reveals Itanium 2 Glitch
NeoChichiri writes "News.com is running on an article about glitches in Intel's Itanium 2 chips. Even though it doesn't affect all chips, they have still stopped shipments of the new 450 Servers until the problem is resolved. Apparently it has to be 'a specific set of operations in a specific sequence with specific data.' Intel is saying that affects the 900MHz and 1 GHz Itanium 2 chips and that it will not affect the upcoming 1.5 GHz Itanium 2 6M chips." Until the next iteration of chip arrives though, Oliver Wendell Jones writes, "they recommend working around the problem by underclocking the processor to run at 800 MHz instead of its default 900 MHz or 1 GHz."
The Itanic 2 appears to be going down like the first...
Is it a glitch or did they sell chips that can't run at the rated speed?
*in Homer Simpsons voice* 'mmmmmm.....Itanium 2 chops.......glazhzhzhz'
Is this something that could be addressed by a microcode update? I've always wondered about exactly what can be done with the Kernel support for microcode updates.
On a side note -- who exactly didn't expect something like this? Intel has a history of this sort of thing -- from the 80486DX not being able to add properly, and IBM having to halt shipments of PS/2 machines; to the Pentium F00F bug and others. Buying first run Intel chips is like playing dice with your business. Give them a few production runs to work out the bugs...
Learning HOW to think is more important than learning WHAT to think.
Perhaps they should put some silver stuff over the serial number. Welcome to the Intel Itanium scratchcard lotto, those with bad chips win a new one :)
Underclocking is typically necessary if a part needs more voltage than is allowed for with the default configuration. This is why when you overclock, the converse is generally required; you can get better overclocks by increasing voltage.
Obviously, Intel are not going to encourage people to increase the voltage of their processors in order to run them at the default speeds, as this can run the risk of thermal damage to the chip with insufficient cooling, or overly high voltages. It may however still represent an option for system administrators who are keen to retain the performance of the chip.
When you consider all the bugs that come through in higher level programming where everything is object oriented and human readable, it really comes as a surprise that you don't see more bugs in hardware considering the complexity of the problem and low level nature.
He who knows not and knows he knows not is a wise man. He who knows not and knows not he knows not is a fool.
Does anyone else find it ironic that when Intel makes one mistake in a processor, everyone jumps on them for making a bad product, but software companies can sell products with thousands of bugs in them and people accept this as normal? Sure, we complain about buggy software, but I don't think anyone here expects any software to be completely bug-free. Why are Intel and other chip manufacturers held to such a high standard? Or, more importantly, why are software companies not held to the same high standards?. If Intel and AMD can make incredibly complex processors that are (usually) completely bug-free, why can't any software company in the world make any product that even comes close to being free of defects?
Disclaimer: The opinions expressed are not necessarily my own, as I've not yet had my medication today.
who exactly didn't expect something like this? Intel has a history of this sort of thing
Of course when it happens to Intel, then EVERYBODY knows about it. My question is, how prevelant is this sort of thing throughout the cpu industry? Anyone know of other "mistakes" by the other major players? It's hard to imagine that only Intel makes these kinds of goofs, esp. with the complexity of todays chips. As an example, wouldn't Mot's failure to scale up the G4 PPC chips be considered an "error"? They just caught it early enough to not to ship any chips and say "oh, we're sorry, our G4's won't go as fast as we originally stated, wait another year and a half or so and we'll get it all sorted out". Didn't they also do a similar thing with the 68040?
I have about 6 years experience in Quality Assurance, with emphasis on electronics, manufacturing processes and attention to detail.
You know...if you're looking for anyone that is.
Your geek membership has been revoked. Hand in your pocket protector at the door. OutOutOut!
"History doesn't repeat itself, but it does rhyme." Mark Twain
There isn't much detailed information about the exact conditions that bring out the bug, but they do state that the bug is electrical, that some unspecified combination of instructions and data pattern are needed, and that reducing the clock frequency avoids the problem. I can think of several things that might cause the bug. These are just guesses.
One possibility is that there is a slow timing path in the logic that is marginally meeting the 900MHz or 1GHz clock speed. Going to 800 MHz gives the slow path more margin. This is the easy answer.
Another possibility is that they have some part of the chip that has insufficient metal to deliver power to the logic gates. The right combination of activity might cause enough voltage droop to cause logic errors. Slowing the clock reduces the power consumption in CMOS chips.
They might have a crosstalk problem between some signals that could flip bits when the right activity and frequency are combined. Slowing the clock can shift the relative positions of signal transitions.
Eventually more details might surface, but Intel is probably keeping it quiet so that people don't write code to maliciously crash servers.
"Open the Itanium register sets, HAL."
...."
"I'm sorry, Dave. I can't do that
-kgj
The problem is a sequence of 1s and 0s. Avoid those two numbers, and you'll be fine.
When all you have is an axe, everything looks like a grindstone.
Finally, The electrical engineers are to blame. I knew my code was correct!
"If you are a dreamer, a wisher, a liar, A hope-er, a pray-er, a magic bean buyer
"Until we're sure the issues are 100 percent resolved, we're going to keep holding back shipments with the 450," IBM spokeswoman Lisa Lanspery said. "We have a policy of zero tolerance for undetected data corruption" at a customer site, she said.
:-)
so detected data corruption is just fine, then...?
ow about Motorola leaving out critical instructions in the PPC603 and crippling every machine with one compared to the PPC601?
That's a very very big reinterpretation of the facts. ppc603 machines were designed for low cost low heat. One of the ways to do this was to further remove instructions that were not needed, legacy instructions from pre-PPC601, and were never designed to be in the 601. They were not 'critical' and did not cripple anything. ppc603 cpus ended up working just for the purpose they were designed for. cheaper and less energy-hungry cpus.
the G3 floating point debacle where excel spreadsheets would show up errors consistently
You made a typo there. "Pentium" is not spelled "G3"
Itanium is a very new architecture. It has the potential for kicking i386 chips in the butt once it has a chance to grow up. With anything as radicaly new as the Itanium, there is a high probability of unexpected problems. AMD has not had this sort of problem resently because they don't have any balls. All they ever do basicaly amounts to minor tweeks of a stable design. Even their 64 bit extensions fall into this catagory.
The type of problem Intel is dealing with could very well be in a new class. I have a hunch that it has to due with either unexpected capacitive coupling ( possibly related to an in-spec extreme of the process variation) or thermal transients causing timing skew. These types of phenomena are nearly impossible to model, especial if its tied to a particular set of process deviations. That is why manufacturer do such extensive qualification testing. Unfortunatly this testing can not be done untill there are enough units to test ( like in the 1000s). This does not happen untill the device is ready for production. Technicaly, this is the Pilot phase of development.
One needs to give Intel some credit for learning a lesson from the Pentium fiascos ( not just the math error, but also the original ( 5V) 90Mhz burn-up issue). At least they are doing the right thing now. Corporations, like people, sometimes need to learn the hard way. Unfortunatly, though people usually retain their lessons, Corporations sometimes need to relearn them, especialy when being run by greedy BODs ( or board members with hidden agendas). AMD has yet to learn this particular lesson. One of these days, they will try to cover up a problem and its not going to work. They have gotten away with some stuff already because everyone loves to hate Intel ( me included, 68000 and PowerPC for me!)
Unless your familiar with LSI semiconductor manufacturing, you should not be commenting. Because you don't have a clue as to what is going on. The posts I've read so far, remind me of what a class of 10 year olds would right in criticing Joseph Conrads "Heart of Darkness".
You really are a troll, tonight!
Please read "Sun suffers UltraSparc II cache crash headache [theregister.co.uk]"
This was a problem with the cache RAM and not the CPU itself. It was traced to a supplier (IBM), who was selling a defective product.
In terms of reliability, the Itanium II is no worse than the UltraSPARC series of chips.
There is no data to back this up. I know you don't have it, and I certainly don't have it. The only people who really have it (Intel and Sun) probably won't give it to us, so this ends here.
However, since so many people pay attention to the flaws in Intel chips, they are likely to have less bugs than other chips.
This is not true. Intel is pressured by a time-to-market more than other suppliers, especially with respect to the Pentium line. Sun has obviously decided to delay product launches to work out issues (e.g., UltraSPARC IIIi), because their customers expect reliability over other concerns. Hardware doesn't really follow the "all bugs are shallow" mantra of the Open Source movement, we mainly have to have faith in the manufacturer's simulation and test labs.
In any event, the performance of the Itanium II is at least 1 order of magnitude greater than the UltraSPARC III and (soon) IV.
Do you even know what "order of magnitude" means? You are claiming that, if the UltraSPARC III scores 975 on something that the Itanium II would score 9750??? For a given clock, it is true that the Itanium II is faster than the US III, but by a fraction--not a factor of ten!
Also, the US IV, by definition, will be almost twice as fast as the US III for throughput, because it is two US III chips in one.
You really don't know what the facts are.
Vote in November. You won't regret it.