Intel Reveals Itanium 2 Glitch

← Back to Stories (view on slashdot.org)

Intel Reveals Itanium 2 Glitch

Posted by timothy on Monday May 12, 2003 @10:21AM from the you-think-you-got-problems dept.

NeoChichiri writes "News.com is running on an article about glitches in Intel's Itanium 2 chips. Even though it doesn't affect all chips, they have still stopped shipments of the new 450 Servers until the problem is resolved. Apparently it has to be 'a specific set of operations in a specific sequence with specific data.' Intel is saying that affects the 900MHz and 1 GHz Itanium 2 chips and that it will not affect the upcoming 1.5 GHz Itanium 2 6M chips." Until the next iteration of chip arrives though, Oliver Wendell Jones writes, "they recommend working around the problem by underclocking the processor to run at 800 MHz instead of its default 900 MHz or 1 GHz."

23 of 249 comments (clear)

Min score:

Reason:

Sort:

Aptly named... by hobbesmaster · 2003-05-12 10:22 · Score: 4, Funny

The Itanic 2 appears to be going down like the first...
Glitch? by Ramjet350 · 2003-05-12 10:23 · Score: 4, Interesting

Is it a glitch or did they sell chips that can't run at the rated speed?
Mmmm by thebatlab · 2003-05-12 10:25 · Score: 4, Funny

*in Homer Simpsons voice* 'mmmmmm.....Itanium 2 chops.......glazhzhzhz'
Microcode? by chill · 2003-05-12 10:28 · Score: 4, Interesting

Is this something that could be addressed by a microcode update? I've always wondered about exactly what can be done with the Kernel support for microcode updates.

On a side note -- who exactly didn't expect something like this? Intel has a history of this sort of thing -- from the 80486DX not being able to add properly, and IBM having to halt shipments of PS/2 machines; to the Pentium F00F bug and others. Buying first run Intel chips is like playing dice with your business. Give them a few production runs to work out the bugs...

--
Learning HOW to think is more important than learning WHAT to think.
Doesn't affect all chips... by gilesjuk · 2003-05-12 10:28 · Score: 4, Funny

Perhaps they should put some silver stuff over the serial number. Welcome to the Intel Itanium scratchcard lotto, those with bad chips win a new one :)
Alternative to underclocking by ethnocidal · 2003-05-12 10:32 · Score: 5, Informative

Underclocking is typically necessary if a part needs more voltage than is allowed for with the default configuration. This is why when you overclock, the converse is generally required; you can get better overclocks by increasing voltage.

Obviously, Intel are not going to encourage people to increase the voltage of their processors in order to run them at the default speeds, as this can run the risk of thermal damage to the chip with insufficient cooling, or overly high voltages. It may however still represent an option for system administrators who are keen to retain the performance of the chip.
I'm actually pretty impressed by Photar · 2003-05-12 10:33 · Score: 4, Interesting

When you consider all the bugs that come through in higher level programming where everything is object oriented and human readable, it really comes as a surprise that you don't see more bugs in hardware considering the complexity of the problem and low level nature.

--
He who knows not and knows he knows not is a wise man. He who knows not and knows not he knows not is a fool.
1. Re:I'm actually pretty impressed by AxelTorvalds · 2003-05-12 12:42 · Score: 4, Interesting
  
  All do respect, but I know how they make chips. They use software to do it and that's why they are so reliable, a human doesn't put each gate in to place. It's also designed with test in mind and there are whole industries and standards surrounding that. Try to name something remotely close to a JTAG interface for software. I believe it's more reliable than software but that's really becuase once you etch a piece of silicon it's pretty damn hard to fix it. Don't get me wrong though, I trust the chip a lot more than the software in most cases, I expect a compiler bug long before I expect to have stumbled on to the magic code stream that doesn't compute correctly and I expect my own errors before that.
  This kind of bug is a little different though, we're not talking about a stuck gate that only gets tickled during a single ALU operation or retiring an instruction too early or bigfooting a register too early or anything like that. We're talking about clocking issues and fundamental timing issues in Intel's "server grade" platform. There are accepted standards and practices for how aggressive to be, some vendors can tell you with amazing detail how reliable their chips are, in what conditions, etc.. With clocks in particular some vendors can be picky, I've seen hard hitters scope up boxes and refuse to support hardware they sold because it was clocked out of spec (think about the edge of a clock and clock quality.. a 1.2 Ghz clock isn't enough, it has to actually achieve the level of the clock before it switches back and it takes time for the clock to transition..) it sounds like Intel is either ignoring them or trying to write their own book or the IA64 is a bigger disaster than any one there wants to even hint at. There are a fairly limited class of errors where underclocking the chip fixes the problem and most of those errors are related to the chip being aggressively clocked to begin with. It's ironic, on IBM's POWER4 line of processors they added extra cache room for parity (at the expense of potential performance) and made the leads more beefy (again at the expense of higher clock speeds) because the platform is a server platform that places reliability at a premium. It sounds like Intel has been making PC chips too long and isn't ready for server grade chips.
  Their party line has been that they will keep working at it until it's ready, they aren't expecting it to move a lot of chips, etc. etc.. Right now they have walked down a road where they have invested billions? (at least hundreds of millions) in an unproven technology. They have crossed the line to the point that there won't be $1500 IA64 products for years and years. They have piped it as a server grade platform. And it underachieves in every area and has't taken the world by storm nearly as much as they said. So bad is it that HP, their blood brother in that mess has continued the PA-RISC and Alpha lines past the point they claimed when they originally adopted the IA64. The only reason I could imagine them to aggressively clock it like that have would be because that's the only way to make it perform remotely like they have claimed it would. I'm not going to guess about Intel's dirty laundry but I'd guess the stakes are little higher than it would look on the surface for the IA64, either that or there are some incompetants running the show.
Ironic? by Jonathan+the+Nerd · 2003-05-12 10:34 · Score: 5, Interesting

Does anyone else find it ironic that when Intel makes one mistake in a processor, everyone jumps on them for making a bad product, but software companies can sell products with thousands of bugs in them and people accept this as normal? Sure, we complain about buggy software, but I don't think anyone here expects any software to be completely bug-free. Why are Intel and other chip manufacturers held to such a high standard? Or, more importantly, why are software companies not held to the same high standards?. If Intel and AMD can make incredibly complex processors that are (usually) completely bug-free, why can't any software company in the world make any product that even comes close to being free of defects?

--
Disclaimer: The opinions expressed are not necessarily my own, as I've not yet had my medication today.
1. Re:Ironic? by cgori · 2003-05-12 11:14 · Score: 4, Informative
  
  I love posts that are COMPLETELY TOTALLY WRONG.
  
  The number of states is 2 to the power of the numbers you were talking about. Even if I take the lowest number ("a couple dozen Kbytes") that you mentioned, it's 2^2*12*1024*8 = 2^24000.
  
  Guess what?
  
  That's a HUGE number -- way bigger than the "billions of petabytes" you were saying is impossible to recreate for software testing. It's roughly equivalent to 10^7200 (if that somehow makes things easier for you). Of course, the "couple dozen Kbytes" is a massive underestimation of the total state of a modern CPU (100 million transistors, even just making flip-flops will give 2.5M bits of state, and for 6T SRAM more like 16M bits).
  
  And then you have the nice problem that physics and electrical phenomena play havoc with hardware testing simulations, as opposed to software, which only has to worry about bad boolean logic.
  
  Come talk to me next time you have to worry about alpha-particle hits changing the state of any of your code or when you care about any event with picosecond granularity (which is just about every day in hardware).
  
  Yes, software testing has even more states to worry about, but trust me when I tell you that the hardware problem is plenty big enough to prevent exhaustive testing from being applicable. Hardware testing uses a lot of brute-force regression and detailed test planning to find and remove bugs. Software folks would do well to use such methodologies.
2. Re:Ironic? by Bombcar · 2003-05-12 11:30 · Score: 4, Insightful
  
  The big problem is when something fails SILENTLY! That's what the BSOD and the Kernel oopsies are! If the system has corrupt data, it is very very bad, worse than losing data. So if the hardware has a bug, then it will pass corrupt data around, and then things fail.....google around for what happens with bad ram, and learn about HAppy Fun Bugs!
  
  --
  Fellowship 9/11
How about others (AMD, Mot, IBM) by binaryDigit · 2003-05-12 10:35 · Score: 5, Interesting

who exactly didn't expect something like this? Intel has a history of this sort of thing

Of course when it happens to Intel, then EVERYBODY knows about it. My question is, how prevelant is this sort of thing throughout the cpu industry? Anyone know of other "mistakes" by the other major players? It's hard to imagine that only Intel makes these kinds of goofs, esp. with the complexity of todays chips. As an example, wouldn't Mot's failure to scale up the G4 PPC chips be considered an "error"? They just caught it early enough to not to ship any chips and say "oh, we're sorry, our G4's won't go as fast as we originally stated, wait another year and a half or so and we'll get it all sorted out". Didn't they also do a similar thing with the 68040?
1. Re:How about others (AMD, Mot, IBM) by vadim_t · 2003-05-12 10:59 · Score: 5, Informative
  
  Not very uncommon, really. Here are some AMD bugs, for example. I think the deal is that the Itanium has a rather serious problem that's been undetected for a long time. Itanium based computers can cost about $20000, which is why it's a big deal. If you have such a system you probably are running something important on it.
Hey Intel! by craenor · 2003-05-12 10:43 · Score: 4, Funny

I have about 6 years experience in Quality Assurance, with emphasis on electronics, manufacturing processes and attention to detail.

You know...if you're looking for anyone that is.
Re:Underclock? by Dr+Caleb · 2003-05-12 10:53 · Score: 4, Funny

Why not just buy the lower-clocked CPU's then?
Your geek membership has been revoked. Hand in your pocket protector at the door. OutOutOut!

--
"History doesn't repeat itself, but it does rhyme." Mark Twain
Possibly timing or power related by dprice · 2003-05-12 10:56 · Score: 4, Informative

There isn't much detailed information about the exact conditions that bring out the bug, but they do state that the bug is electrical, that some unspecified combination of instructions and data pattern are needed, and that reducing the clock frequency avoids the problem. I can think of several things that might cause the bug. These are just guesses.

One possibility is that there is a slow timing path in the logic that is marginally meeting the 900MHz or 1GHz clock speed. Going to 800 MHz gives the slow path more margin. This is the easy answer.

Another possibility is that they have some part of the chip that has insufficient metal to deliver power to the logic gates. The right combination of activity might cause enough voltage droop to cause logic errors. Slowing the clock reduces the power consumption in CMOS chips.
They might have a crosstalk problem between some signals that could flip bits when the right activity and frequency are combined. Slowing the clock can shift the relative positions of signal transitions.

Eventually more details might surface, but Intel is probably keeping it quiet so that people don't write code to maliciously crash servers.
HAL-9000 on Itanium by handy_vandal · 2003-05-12 10:57 · Score: 4, Funny

"Open the Itanium register sets, HAL."

"I'm sorry, Dave. I can't do that ...."

--
-kgj
Re:Deja Vu by Michael_Burton · 2003-05-12 11:06 · Score: 5, Funny

The problem is a sequence of 1s and 0s. Avoid those two numbers, and you'll be fine.

--
When all you have is an axe, everything looks like a grindstone.
Mwhahaha by nate+nice · 2003-05-12 11:09 · Score: 5, Funny

Finally, The electrical engineers are to blame. I knew my code was correct!

--
"If you are a dreamer, a wisher, a liar, A hope-er, a pray-er, a magic bean buyer ..."
zero tolerance for undetected corruption...? by bani · 2003-05-12 11:11 · Score: 4, Funny

"Until we're sure the issues are 100 percent resolved, we're going to keep holding back shipments with the 450," IBM spokeswoman Lisa Lanspery said. "We have a policy of zero tolerance for undetected data corruption" at a customer site, she said.

so detected data corruption is just fine, then...? :-)
Not ppc603s by questamor · 2003-05-12 11:49 · Score: 4, Informative

ow about Motorola leaving out critical instructions in the PPC603 and crippling every machine with one compared to the PPC601?

That's a very very big reinterpretation of the facts. ppc603 machines were designed for low cost low heat. One of the ways to do this was to further remove instructions that were not needed, legacy instructions from pre-PPC601, and were never designed to be in the 601. They were not 'critical' and did not cripple anything. ppc603 cpus ended up working just for the purpose they were designed for. cheaper and less energy-hungry cpus.

the G3 floating point debacle where excel spreadsheets would show up errors consistently

You made a typo there. "Pentium" is not spelled "G3"
Geesh, Give Intel a Break by Mooncaller · 2003-05-12 11:51 · Score: 4, Interesting

Itanium is a very new architecture. It has the potential for kicking i386 chips in the butt once it has a chance to grow up. With anything as radicaly new as the Itanium, there is a high probability of unexpected problems. AMD has not had this sort of problem resently because they don't have any balls. All they ever do basicaly amounts to minor tweeks of a stable design. Even their 64 bit extensions fall into this catagory.

The type of problem Intel is dealing with could very well be in a new class. I have a hunch that it has to due with either unexpected capacitive coupling ( possibly related to an in-spec extreme of the process variation) or thermal transients causing timing skew. These types of phenomena are nearly impossible to model, especial if its tied to a particular set of process deviations. That is why manufacturer do such extensive qualification testing. Unfortunatly this testing can not be done untill there are enough units to test ( like in the 1000s). This does not happen untill the device is ready for production. Technicaly, this is the Pilot phase of development.

One needs to give Intel some credit for learning a lesson from the Pentium fiascos ( not just the math error, but also the original ( 5V) 90Mhz burn-up issue). At least they are doing the right thing now. Corporations, like people, sometimes need to learn the hard way. Unfortunatly, though people usually retain their lessons, Corporations sometimes need to relearn them, especialy when being run by greedy BODs ( or board members with hidden agendas). AMD has yet to learn this particular lesson. One of these days, they will try to cover up a problem and its not going to work. They have gotten away with some stuff already because everyone loves to hate Intel ( me included, 68000 and PowerPC for me!)

Unless your familiar with LSI semiconductor manufacturing, you should not be commenting. Because you don't have a clue as to what is going on. The posts I've read so far, remind me of what a class of 10 year olds would right in criticing Joseph Conrads "Heart of Darkness".
Re:Problem is the Hardware (re:Microcode?) by Mr.+Piddle · 2003-05-12 13:01 · Score: 5, Insightful

You really are a troll, tonight!

Please read "Sun suffers UltraSparc II cache crash headache [theregister.co.uk]"

This was a problem with the cache RAM and not the CPU itself. It was traced to a supplier (IBM), who was selling a defective product.

In terms of reliability, the Itanium II is no worse than the UltraSPARC series of chips.

There is no data to back this up. I know you don't have it, and I certainly don't have it. The only people who really have it (Intel and Sun) probably won't give it to us, so this ends here.

However, since so many people pay attention to the flaws in Intel chips, they are likely to have less bugs than other chips.

This is not true. Intel is pressured by a time-to-market more than other suppliers, especially with respect to the Pentium line. Sun has obviously decided to delay product launches to work out issues (e.g., UltraSPARC IIIi), because their customers expect reliability over other concerns. Hardware doesn't really follow the "all bugs are shallow" mantra of the Open Source movement, we mainly have to have faith in the manufacturer's simulation and test labs.

In any event, the performance of the Itanium II is at least 1 order of magnitude greater than the UltraSPARC III and (soon) IV.

Do you even know what "order of magnitude" means? You are claiming that, if the UltraSPARC III scores 975 on something that the Itanium II would score 9750??? For a given clock, it is true that the Itanium II is faster than the US III, but by a fraction--not a factor of ten!

Also, the US IV, by definition, will be almost twice as fast as the US III for throughput, because it is two US III chips in one.

You really don't know what the facts are.

--
Vote in November. You won't regret it.