Slashdot Mirror


Intel Reveals Itanium 2 Glitch

NeoChichiri writes "News.com is running on an article about glitches in Intel's Itanium 2 chips. Even though it doesn't affect all chips, they have still stopped shipments of the new 450 Servers until the problem is resolved. Apparently it has to be 'a specific set of operations in a specific sequence with specific data.' Intel is saying that affects the 900MHz and 1 GHz Itanium 2 chips and that it will not affect the upcoming 1.5 GHz Itanium 2 6M chips." Until the next iteration of chip arrives though, Oliver Wendell Jones writes, "they recommend working around the problem by underclocking the processor to run at 800 MHz instead of its default 900 MHz or 1 GHz."

12 of 249 comments (clear)

  1. Glitch? by Ramjet350 · · Score: 4, Interesting

    Is it a glitch or did they sell chips that can't run at the rated speed?

  2. Microcode? by chill · · Score: 4, Interesting

    Is this something that could be addressed by a microcode update? I've always wondered about exactly what can be done with the Kernel support for microcode updates.

    On a side note -- who exactly didn't expect something like this? Intel has a history of this sort of thing -- from the 80486DX not being able to add properly, and IBM having to halt shipments of PS/2 machines; to the Pentium F00F bug and others. Buying first run Intel chips is like playing dice with your business. Give them a few production runs to work out the bugs...

    --
    Learning HOW to think is more important than learning WHAT to think.
  3. Deja Vu by Jason1729 · · Score: 2, Interesting

    Apparently it has to be 'a specific set of operations in a specific sequence with specific data.

    This sounds similar to the way they described the floating point divide error in the original pentium. How long until they start giving odds on the chances of someone seeing the problem in normal use.

    Jason
    ProfQuotes

  4. Underclock? by phorm · · Score: 3, Interesting

    "they recommend working around the problem by underclocking the processor to run at 800 MHz instead of it's default 900 MHz or 1 GHz."

    Why not just buy the lower-clocked CPU's then? Will Intel replace the crap chips when a revision with a fix comes around?
    "If the customer feels it's the right solution, we'll exchange processors with ones that aren't affected," she said. Intel has developed a simple software test that can determine whether a chip is affected. Meaning what? Lower-end chips that aren't aaffected, or a fixed version of the same chip. If it's the same chip, who wouldn't think it is the right solution? The article doesn't indicate whether the problem is actually solved either, but that it seems to be somewhat of an anomaly that doesn't affect all chips.

    Not a good day for Intel, and probably another reason why you don't immediately need that "Newest on the shelf" CPU, whether for your home machine or a server. Besides, by the time this chip is assuredly fixed, a faster revision will probably be out at a comparable price.

  5. I'm actually pretty impressed by Photar · · Score: 4, Interesting

    When you consider all the bugs that come through in higher level programming where everything is object oriented and human readable, it really comes as a surprise that you don't see more bugs in hardware considering the complexity of the problem and low level nature.

    --
    He who knows not and knows he knows not is a wise man. He who knows not and knows not he knows not is a fool.
    1. Re:I'm actually pretty impressed by AxelTorvalds · · Score: 4, Interesting
      All do respect, but I know how they make chips. They use software to do it and that's why they are so reliable, a human doesn't put each gate in to place. It's also designed with test in mind and there are whole industries and standards surrounding that. Try to name something remotely close to a JTAG interface for software. I believe it's more reliable than software but that's really becuase once you etch a piece of silicon it's pretty damn hard to fix it. Don't get me wrong though, I trust the chip a lot more than the software in most cases, I expect a compiler bug long before I expect to have stumbled on to the magic code stream that doesn't compute correctly and I expect my own errors before that.

      This kind of bug is a little different though, we're not talking about a stuck gate that only gets tickled during a single ALU operation or retiring an instruction too early or bigfooting a register too early or anything like that. We're talking about clocking issues and fundamental timing issues in Intel's "server grade" platform. There are accepted standards and practices for how aggressive to be, some vendors can tell you with amazing detail how reliable their chips are, in what conditions, etc.. With clocks in particular some vendors can be picky, I've seen hard hitters scope up boxes and refuse to support hardware they sold because it was clocked out of spec (think about the edge of a clock and clock quality.. a 1.2 Ghz clock isn't enough, it has to actually achieve the level of the clock before it switches back and it takes time for the clock to transition..) it sounds like Intel is either ignoring them or trying to write their own book or the IA64 is a bigger disaster than any one there wants to even hint at. There are a fairly limited class of errors where underclocking the chip fixes the problem and most of those errors are related to the chip being aggressively clocked to begin with. It's ironic, on IBM's POWER4 line of processors they added extra cache room for parity (at the expense of potential performance) and made the leads more beefy (again at the expense of higher clock speeds) because the platform is a server platform that places reliability at a premium. It sounds like Intel has been making PC chips too long and isn't ready for server grade chips.

      Their party line has been that they will keep working at it until it's ready, they aren't expecting it to move a lot of chips, etc. etc.. Right now they have walked down a road where they have invested billions? (at least hundreds of millions) in an unproven technology. They have crossed the line to the point that there won't be $1500 IA64 products for years and years. They have piped it as a server grade platform. And it underachieves in every area and has't taken the world by storm nearly as much as they said. So bad is it that HP, their blood brother in that mess has continued the PA-RISC and Alpha lines past the point they claimed when they originally adopted the IA64. The only reason I could imagine them to aggressively clock it like that have would be because that's the only way to make it perform remotely like they have claimed it would. I'm not going to guess about Intel's dirty laundry but I'd guess the stakes are little higher than it would look on the surface for the IA64, either that or there are some incompetants running the show.

  6. Ironic? by Jonathan+the+Nerd · · Score: 5, Interesting

    Does anyone else find it ironic that when Intel makes one mistake in a processor, everyone jumps on them for making a bad product, but software companies can sell products with thousands of bugs in them and people accept this as normal? Sure, we complain about buggy software, but I don't think anyone here expects any software to be completely bug-free. Why are Intel and other chip manufacturers held to such a high standard? Or, more importantly, why are software companies not held to the same high standards?. If Intel and AMD can make incredibly complex processors that are (usually) completely bug-free, why can't any software company in the world make any product that even comes close to being free of defects?

    --
    Disclaimer: The opinions expressed are not necessarily my own, as I've not yet had my medication today.
    1. Re:Ironic? by Entropy_ah · · Score: 2, Interesting

      Spot on.
      I'd like to add another problem with testing. How do you know if the processor is giving the correct answer??? Work it out by hand??? Test it on another processor that may or may not have the same design flaws??

      --
      my other penis is a vagina
  7. How about others (AMD, Mot, IBM) by binaryDigit · · Score: 5, Interesting

    who exactly didn't expect something like this? Intel has a history of this sort of thing

    Of course when it happens to Intel, then EVERYBODY knows about it. My question is, how prevelant is this sort of thing throughout the cpu industry? Anyone know of other "mistakes" by the other major players? It's hard to imagine that only Intel makes these kinds of goofs, esp. with the complexity of todays chips. As an example, wouldn't Mot's failure to scale up the G4 PPC chips be considered an "error"? They just caught it early enough to not to ship any chips and say "oh, we're sorry, our G4's won't go as fast as we originally stated, wait another year and a half or so and we'll get it all sorted out". Didn't they also do a similar thing with the 68040?

    1. Re:How about others (AMD, Mot, IBM) by Photar · · Score: 2, Interesting

      I agree I'd like to see some stats on "bugs" in hardware.

      I think Motos problem is they're too busy making cell phones to worry about PPC.

      --
      He who knows not and knows he knows not is a wise man. He who knows not and knows not he knows not is a fool.
  8. Re:zero tolerance for undetected corruption...? by Anonymous Coward · · Score: 1, Interesting

    Detected data corruption can possibly be fixed. Such as a single bit error on an ECC protected memory bus. And if the ECC error is a multibit error and unfixable, the OS is notified that the read got corrupted, and the OS is free to try again or shutdown or do whatever it thinks is reasonable. But that's better than say, getting the wrong stock buy order for a financial transaction.

    So, to a certain extent, detected data corruption is fine, because server class chipsets are designed to be able to correct or handle some types of mistakes.

  9. Geesh, Give Intel a Break by Mooncaller · · Score: 4, Interesting

    Itanium is a very new architecture. It has the potential for kicking i386 chips in the butt once it has a chance to grow up. With anything as radicaly new as the Itanium, there is a high probability of unexpected problems. AMD has not had this sort of problem resently because they don't have any balls. All they ever do basicaly amounts to minor tweeks of a stable design. Even their 64 bit extensions fall into this catagory.

    The type of problem Intel is dealing with could very well be in a new class. I have a hunch that it has to due with either unexpected capacitive coupling ( possibly related to an in-spec extreme of the process variation) or thermal transients causing timing skew. These types of phenomena are nearly impossible to model, especial if its tied to a particular set of process deviations. That is why manufacturer do such extensive qualification testing. Unfortunatly this testing can not be done untill there are enough units to test ( like in the 1000s). This does not happen untill the device is ready for production. Technicaly, this is the Pilot phase of development.

    One needs to give Intel some credit for learning a lesson from the Pentium fiascos ( not just the math error, but also the original ( 5V) 90Mhz burn-up issue). At least they are doing the right thing now. Corporations, like people, sometimes need to learn the hard way. Unfortunatly, though people usually retain their lessons, Corporations sometimes need to relearn them, especialy when being run by greedy BODs ( or board members with hidden agendas). AMD has yet to learn this particular lesson. One of these days, they will try to cover up a problem and its not going to work. They have gotten away with some stuff already because everyone loves to hate Intel ( me included, 68000 and PowerPC for me!)

    Unless your familiar with LSI semiconductor manufacturing, you should not be commenting. Because you don't have a clue as to what is going on. The posts I've read so far, remind me of what a class of 10 year olds would right in criticing Joseph Conrads "Heart of Darkness".