Intel Reveals Itanium 2 Glitch
NeoChichiri writes "News.com is running on an article about glitches in Intel's Itanium 2 chips. Even though it doesn't affect all chips, they have still stopped shipments of the new 450 Servers until the problem is resolved. Apparently it has to be 'a specific set of operations in a specific sequence with specific data.' Intel is saying that affects the 900MHz and 1 GHz Itanium 2 chips and that it will not affect the upcoming 1.5 GHz Itanium 2 6M chips." Until the next iteration of chip arrives though, Oliver Wendell Jones writes, "they recommend working around the problem by underclocking the processor to run at 800 MHz instead of its default 900 MHz or 1 GHz."
Underclocking is typically necessary if a part needs more voltage than is allowed for with the default configuration. This is why when you overclock, the converse is generally required; you can get better overclocks by increasing voltage.
Obviously, Intel are not going to encourage people to increase the voltage of their processors in order to run them at the default speeds, as this can run the risk of thermal damage to the chip with insufficient cooling, or overly high voltages. It may however still represent an option for system administrators who are keen to retain the performance of the chip.
Agilent Technologiesr sity of Oslo
ChevronTexaco
Cornell University
DreamWorks
Johns Hopkins University
Liberty Medical
National Crash Analysis Ctr.
NCSA
PNNL
Rice University
Sony Pictures ImageWorks
Wells Fargo
VeriSign, Inc.
Airbus
British Petroleum
CERN
Daimler-Chrysler
Daresbury Laboratory
Erickson Utvecklings AB
HLRS
Philips Semiconductor
Preussag
SecFinex
Triaton
Unive
VTG-Lehnkering AG
Bio-Informatics Institute
Fujitsu ISOTEC, Ltd.
Ibaraki Hitachi Information Service Co., Ltd.
MarketBoomer
Mazda
Mitsubishi Heavy Industries
Mitsui Chemicals
Okazaki National Research Institute
Singapore-MIT Alliance
Subaru Research
Toyota Autobody Corp.
(nad lots of others....)
because a processor executes a finite set of instructions and states which can be tested fully. the maximum number of states are equivalent to the on-board memory on the processor (8 megs of cache or so usually) and the address/data lines which are also finite. the microcode is usually less than a couple of dozen Kbytes. ..testing everything 100% would require recreating every piece of data on the internet / keyboard / mouse / database / real world that may potentially feed into the software (usually billions of petabytes if its even possible to recreate it), all possible states of the files on the drive in hundreds of gigabytes, the gigabytes of memory the software has access to and the same goes for the OS it runs on, the database it talks to and everything else... support libraries etc.
on the other hand a piece of software has the equivalent number of states that fit into the compiled code, any data it accesses from the hard drive, the contents of its memory, any data inputted or outputted from the internet/real world / keyboard / mouse / database / wherever..
in other words, hardware is simple to test so it SHOULD NOT have bugs, software is hard to test so it WILL have bugs.
It's not just Intel. How about Motorola leaving out critical instructions in the PPC603 and crippling every machine with one compared to the PPC601? or the G3 floating point debacle where excel spreadsheets would show up errors consistently. What about AMD and their first run overheating problems? running hot is one thing, burning up even with adequate cooling is another.
Best option is not to restrict yourself to certain "runs" but to just see the performance of a run yourself. The aforementioned PPC601 was a good example. A fantastic CPU in its first incarnation
Haha, i can't help buy laught at this. I'm assuming you haven't taken a superscalar computer architecture class. Testing cpus is a bitch. It is just as hard if not harder than testing software. Sure there are only so many instructions in a cpu, but you have to deal with multiple combinations of instructions and what order they occur. To fully test a modern cpu in every possible state with every possible set of input, it would take more than your life time. Testing the cpu is equally as important as design. People who don't know dick about it blow it off as some easy task, but it is very time consuming and can be mentally taxing to create the best set of test vectors.
There isn't much detailed information about the exact conditions that bring out the bug, but they do state that the bug is electrical, that some unspecified combination of instructions and data pattern are needed, and that reducing the clock frequency avoids the problem. I can think of several things that might cause the bug. These are just guesses.
One possibility is that there is a slow timing path in the logic that is marginally meeting the 900MHz or 1GHz clock speed. Going to 800 MHz gives the slow path more margin. This is the easy answer.
Another possibility is that they have some part of the chip that has insufficient metal to deliver power to the logic gates. The right combination of activity might cause enough voltage droop to cause logic errors. Slowing the clock reduces the power consumption in CMOS chips.
They might have a crosstalk problem between some signals that could flip bits when the right activity and frequency are combined. Slowing the clock can shift the relative positions of signal transitions.
Eventually more details might surface, but Intel is probably keeping it quiet so that people don't write code to maliciously crash servers.
Not very uncommon, really. Here are some AMD bugs, for example. I think the deal is that the Itanium has a rather serious problem that's been undetected for a long time. Itanium based computers can cost about $20000, which is why it's a big deal. If you have such a system you probably are running something important on it.
In terms of reliability, the Itanium II is no worse than the UltraSPARC series of chips. Both Itanium and UltraSPARC face the daunting task of debugging 100+ million transistors. Ensuring that the fabricated chip is bug free is virtually impossible. So, both companies have substantial errata sheets.
The reason that Intel chips "appear" to be more error prone than other companies' chips is that Intel chips are extremely popular. So, people tend to pay far more attention to flaws in Intel chips than they do to flaws in other comapanies' chips. However, since so many people pay attention to the flaws in Intel chips, they are likely to have less bugs than other chips. The economies of scale that, say, the Pentium 4 enjoys means that if the Pentium 4 does have a bug, then it will likely be found by someone among the gazillion users. Then, Intel will fix the problem. Economies of scale help to lower the cost of a product but also help to lower the number of bugs.
In any event, the performance of the Itanium II is at least 1 order of magnitude greater than the UltraSPARC III and (soon) IV. That performance difference is due to serious architectural mistakes in the UltraSPARC family of processors.
The 68040 bug affected quite a few LC040 machines, which made running FPU emulation on them horrid. Basically, trapping calls to the FPU in order to emulate them in software doesn't work as it should. It's b0rked, and most Apple 68LC040 machines just cannot fully emulate an FPU. That wasn't such a problem with the MacOS at the time, as it didn't need an FPU for any functions, nor did most apps.
Running a normal Linux or NetBSD on one of these machines is asking for pain however,.
I love posts that are COMPLETELY TOTALLY WRONG.
The number of states is 2 to the power of the numbers you were talking about. Even if I take the lowest number ("a couple dozen Kbytes") that you mentioned, it's 2^2*12*1024*8 = 2^24000.
Guess what?
That's a HUGE number -- way bigger than the "billions of petabytes" you were saying is impossible to recreate for software testing. It's roughly equivalent to 10^7200 (if that somehow makes things easier for you). Of course, the "couple dozen Kbytes" is a massive underestimation of the total state of a modern CPU (100 million transistors, even just making flip-flops will give 2.5M bits of state, and for 6T SRAM more like 16M bits).
And then you have the nice problem that physics and electrical phenomena play havoc with hardware testing simulations, as opposed to software, which only has to worry about bad boolean logic.
Come talk to me next time you have to worry about alpha-particle hits changing the state of any of your code or when you care about any event with picosecond granularity (which is just about every day in hardware).
Yes, software testing has even more states to worry about, but trust me when I tell you that the hardware problem is plenty big enough to prevent exhaustive testing from being applicable. Hardware testing uses a lot of brute-force regression and detailed test planning to find and remove bugs. Software folks would do well to use such methodologies.
If it was a 7-series, its has an "I-Drive" computer, which runs Windows CE.
Nothing to see here; Move along.
ow about Motorola leaving out critical instructions in the PPC603 and crippling every machine with one compared to the PPC601?
That's a very very big reinterpretation of the facts. ppc603 machines were designed for low cost low heat. One of the ways to do this was to further remove instructions that were not needed, legacy instructions from pre-PPC601, and were never designed to be in the 601. They were not 'critical' and did not cripple anything. ppc603 cpus ended up working just for the purpose they were designed for. cheaper and less energy-hungry cpus.
the G3 floating point debacle where excel spreadsheets would show up errors consistently
You made a typo there. "Pentium" is not spelled "G3"
_Every_ CPU design made by anyone has errata documents. AMD, SUN, DEC, HP, Intel and all other CPU and hardware products end up having a flaw that gets out that causes it to behave outside of specs. Even microcontroller chips, those that execute less than a tenth of the opcodes and have less than 1k RAM and 8k ROM have had erratum posted which must be worked around, fixed or replaced.
Also, true exhaustive testing is not just about testing all opcodes by running all of them, it is about testing all opcodes reading from all possible registers with all possible data permutations and all possible pipeline orderings.
I think circuitry on a new CPU has long passed the complexity of a city the size of Alaska. Complete exhaustive testing of hardware has long been impossible to do on new computer chips.
Remember sun's ECC cache bug?
Fortunately, that was just a supplier issue, where IBM was giving Sun bad cache RAM. This problem certainly caused a lot of unhappy customers, but it was a straight-forward resolution compared to fixing or patching the CPU itself.
I've read that the UltraSPARC CPUs themselves tend to have very low errata rates, like a half dozen or so for the UltraSPARC II compared to dozens for Intel's Pentium chips. This is probably the result of Sun's long development and testing cycles, which, in turn, cause Sun's apparent lag in recent benchmarks. Everything's a compromise, I guess.
Vote in November. You won't regret it.
While I have no particular animosity toward Intel, other than it is important for there always to be competition to push them, I do not think they need to be let off the hook. Itanium has been around a very long time. You may think of it as new technology, but that is more because of the lack of acceptance in the marketplace, not because it has only recently been released. What was happening all of these years since Itanium was initially launched?
Additionally, while the Itanium instruction set takes a different approach to those of Intel's competitors, they are not the only company introducing new CPUs. I do not remember such problems when other 64bit CPUs with their own, new, unique instruction sets were launched by Digital, HP, IBM or Sun to name just a few. These days, the competitive landscape has been radically reduced. Digital no longer exists and its Alpha architecture is owned by Intel. HP, while it still owns its PA-RISC architecture, is trying to migrate its customers to Itanium, though it is hard to say what will really happen to PA-RISC since no one seems anxious to adopt Itanium. IBM also has picked up Itanium, so who knows what will happen to their RISC architecture? That leaves Sun, and while SPARC has always been the weaker of the RISC architectures, it seems to be the primary remaining competitor to Itanium and Intel. Of course, who knows how much longer Sun will survive as an independent company? Maybe they are the next to be gobbled up by IBM or HP, both already commited to Itanium, so what happens to SPARC?
Finally, it is hard to say exactly where AMD fits in all of this. Its 32-bit line provides excellent competition to Intel's 32-bit Pentium family (now at P4), and the AMD 64-bit architecture looks like a nice increment beyond the now very old x86 32-bit architecture. But, in terms of major pressure on future CPU architectures? I just don't know where that competition will be coming from... Maybe China, Russia, Japan, India? Places not noted for their hi-tech prowess, but with lots of experience in fabrication and lots of affordable talent?
Er, you do seems to be trolling just a bit. The US-III@ 1.2GHz achieves a base SPECint of 637, and the 1.0GHz Itanium-2 is 807. Yeah, it beats it, but trounces it? err, well, not really.
And it's a far cry from the "order of magnitude" better performance than the grandparent post's claims.
What's really funny about this post is that normally I am the one bashing Sun's CPUs... *boggle*
Obligatory AMD note: the new SPEC update today shows that a 1.8GHz Opteron SPECint base is 1081.
On a price/performance basis, I would consider that to be the trouncing chip -- maybe even in the order-of-magnitude range.
I know that the Sparc processor in my Ultra 1 has some sort of an 64 bit instruction bug that's bad enough that Sun defaulted the firmware in Ultra 1's to 32 bit mode. You have to change a jumper on the motherboard (hard to get to after opening the case) in order to reflash the Firmware and run 64 bits. I believe the bug is an instruction you can call that crashes the system. Someone else can add more details, I just run NetBSD/Sparc64 on the machine and it's not publically accessable.