Intel Reveals Itanium 2 Glitch
NeoChichiri writes "News.com is running on an article about glitches in Intel's Itanium 2 chips. Even though it doesn't affect all chips, they have still stopped shipments of the new 450 Servers until the problem is resolved. Apparently it has to be 'a specific set of operations in a specific sequence with specific data.' Intel is saying that affects the 900MHz and 1 GHz Itanium 2 chips and that it will not affect the upcoming 1.5 GHz Itanium 2 6M chips." Until the next iteration of chip arrives though, Oliver Wendell Jones writes, "they recommend working around the problem by underclocking the processor to run at 800 MHz instead of its default 900 MHz or 1 GHz."
Underclocking is typically necessary if a part needs more voltage than is allowed for with the default configuration. This is why when you overclock, the converse is generally required; you can get better overclocks by increasing voltage.
Obviously, Intel are not going to encourage people to increase the voltage of their processors in order to run them at the default speeds, as this can run the risk of thermal damage to the chip with insufficient cooling, or overly high voltages. It may however still represent an option for system administrators who are keen to retain the performance of the chip.
There isn't much detailed information about the exact conditions that bring out the bug, but they do state that the bug is electrical, that some unspecified combination of instructions and data pattern are needed, and that reducing the clock frequency avoids the problem. I can think of several things that might cause the bug. These are just guesses.
One possibility is that there is a slow timing path in the logic that is marginally meeting the 900MHz or 1GHz clock speed. Going to 800 MHz gives the slow path more margin. This is the easy answer.
Another possibility is that they have some part of the chip that has insufficient metal to deliver power to the logic gates. The right combination of activity might cause enough voltage droop to cause logic errors. Slowing the clock reduces the power consumption in CMOS chips.
They might have a crosstalk problem between some signals that could flip bits when the right activity and frequency are combined. Slowing the clock can shift the relative positions of signal transitions.
Eventually more details might surface, but Intel is probably keeping it quiet so that people don't write code to maliciously crash servers.
Not very uncommon, really. Here are some AMD bugs, for example. I think the deal is that the Itanium has a rather serious problem that's been undetected for a long time. Itanium based computers can cost about $20000, which is why it's a big deal. If you have such a system you probably are running something important on it.
I love posts that are COMPLETELY TOTALLY WRONG.
The number of states is 2 to the power of the numbers you were talking about. Even if I take the lowest number ("a couple dozen Kbytes") that you mentioned, it's 2^2*12*1024*8 = 2^24000.
Guess what?
That's a HUGE number -- way bigger than the "billions of petabytes" you were saying is impossible to recreate for software testing. It's roughly equivalent to 10^7200 (if that somehow makes things easier for you). Of course, the "couple dozen Kbytes" is a massive underestimation of the total state of a modern CPU (100 million transistors, even just making flip-flops will give 2.5M bits of state, and for 6T SRAM more like 16M bits).
And then you have the nice problem that physics and electrical phenomena play havoc with hardware testing simulations, as opposed to software, which only has to worry about bad boolean logic.
Come talk to me next time you have to worry about alpha-particle hits changing the state of any of your code or when you care about any event with picosecond granularity (which is just about every day in hardware).
Yes, software testing has even more states to worry about, but trust me when I tell you that the hardware problem is plenty big enough to prevent exhaustive testing from being applicable. Hardware testing uses a lot of brute-force regression and detailed test planning to find and remove bugs. Software folks would do well to use such methodologies.
ow about Motorola leaving out critical instructions in the PPC603 and crippling every machine with one compared to the PPC601?
That's a very very big reinterpretation of the facts. ppc603 machines were designed for low cost low heat. One of the ways to do this was to further remove instructions that were not needed, legacy instructions from pre-PPC601, and were never designed to be in the 601. They were not 'critical' and did not cripple anything. ppc603 cpus ended up working just for the purpose they were designed for. cheaper and less energy-hungry cpus.
the G3 floating point debacle where excel spreadsheets would show up errors consistently
You made a typo there. "Pentium" is not spelled "G3"