Intel's Nehalem EX To Gain Error Correction
angry tapir writes "Intel's eight-core Nehalem EX server processor will include a technology derived from its high-end Itanium chips that helps to reduce data corruption and ensure reliable server performance. The processor will include an error correction feature called MCA Recovery, which will detect and fix errors that could otherwise cause systems to crash — it will be able to detect system errors originating in the CPU or system memory and work with the operating system to correct them." Update: 05/27 19:11 GMT by T : Dave Altavilla suggests also Hot Hardware's coverage of the new chip, which includes quite a bit more information.
This is nice and all; but I, for one, will not be satisfied until Intel releases a CPU that does what I mean, not what I say.
Error correction on an x86 chip?
Sweet. Now all those high-end server applications running on x86s that need great uptime can finally join the big boys. [rolls eyes].
I'm just not sure of the utility here -- I RTFA, but I'm still not clear on why Intel would cannibalize Itanium sales (new release delayed again) by offering error correction on Nehalem chips. Is the demand for x86 Server chips that high? I thought anyone requiring 5 nines (or anything close to it) would never consider using x86?
Can someone with more knowledge of the high-end server market please clarify?
"Trolls they were, but filled with the evil will of their master: a fell race..." -- J.R.R. Tolkien on Olog-hai
This will fix many errors affecting the processor itself (new manufacturing processes make transistors quite vulnerable to interference and aging). ECC will still be needed for correcting errors affecting data while it is stored in main memory.
Parity will be needed for protecting caches (possibly ECC will be used in the future). Checksums for data on the hard drive. CRCs for packets on the network. And so on...
No. ECC only corrects certain issues in the memory. It cannot help with memory controller errors, nor with register or TLB errors.
The article seemed pretty light on details of what MCA Recovery actually does. I found this presentation in PDF format that seems to go into some more useful detail about what this is. It's not just ECC to repair single-bit errors (although that is part of it, apparently). It also includes features to recover from errors that cannot simply be corrected. For example it includes a mechanism to notify the OS of the details of an uncorrectable error, so that it could presumably re-load a page full of program code from disk, or terminate an application if its data has been corrupted, instead of shutting down the whole machine.
I'm a bit surprised this is only seeing the light now: as we get smaller and faster, the number of errors observed goes up amazingly.
Back in the stone age, Cray computers didn't even have parity memory, partly because they were willing to re-run programs but mostly because errors were unlikely. Cray himself famously said "parity is for farmers".
These days, errors are very common, and I'm literally amazed that x86s don't have better-than-ECC error detection and correction. All the commercial Unix vendors have them.
--dave
davecb@spamcop.net
These days, errors are very common, and I'm literally amazed that x86s don't have better-than-ECC error detection and correction. All the commercial Unix vendors have them.
Intel's been trying to 'protect' the market for itanium - those cpus have had it for years, probably from day 1. HP definitely markets MCA has a big feature of their itanium based systems.
If AMD were smart, they would have incorporated it into their Opteron line just like they did x64 to cut Intel off at the knees.
When information is power, privacy is freedom.
State of the non-mainframe art with regards to RAS right now is ECC RAM with mirroring, parity cache, ECC e-cache, hashes that detect and fix multiple bit errors for storage end to end, CRC (ethernet) and cksum (TCP, UDP) (but can you trust the nic offloading engine?), instruction retry, and fp scrubbing, in addition to what has been around for the last five years or so.
*sniff* x86 is getting to be so grown up *sniff* I remember when it was just a little 16 bit chip.
And Nehalem is an all in-order design, so they can scale out to very large numbers of cores or register-and-decoder sets on a single chip. That helps offset the huge bottleneck of trying to go to molasses-slow main memory on every cache miss, by allowing another thread to run. Mind you, I'd want enough cores to host 128 threads in order to at least match the new SPARCs, but that can come along later (;-))
You must be thinking of Atom, because Nehalem is definitely an out-of-order processor and not particularly small either. It does use SMT (and a big instruction window) to hide memory latency (and to keep its 4-wide execution engine busy), but that's having multiple threads running on the same core.
Frankly while Niagra is a very interesting approach that I think will only become more popular in the future (and Atom is theoretically capable of doing the same thing though right now it's just embedded stuff), for now there are many server apps where single-thread performance still matters greatly and for that out-of-order is the way to go (as Intel found out the hard way by trying every trick in the book to make an in-order machine fast enough).
The enemies of Democracy are
Imagine a Beowulf cluster of these!
Chas - The one, the only.
THANK GOD!!!
The original Opteron had L1 ECC, it just wasn't correctable if encountered on a read or write (there was a scrubber that would find and correct ECC errors, but if it didn't reach the line in question before the program accessed the cache line, then it would detect the error and machine check fault). The ill-fated Barcelona (Phenom) added on-the-fly correctability. Phenom 2 of course has it too.
I was pretty sure Intel had it in their L1s too. Kinda surprised to hear SPARC doesn't.
P.S. I know The Inquirer decided it was the K10, but it isn't. They're still all K8s.
The enemies of Democracy are
Sure enough it is in the Phenom datasheet, thank you.
As far as I know T1, T2, and T2+ all have only parity for the I$ and D$. All the Fujitsu sparcs that I know of only have parity for I$ and D$ as well. ECC e-cache is the norm though.
Sparc was odd. They had all sorts of strange caches from one model to the next. Sometimes there was an I$ and D$, sometimes it was unified. Sometimes some caches were virtually tagged. There was an ultrasparc that had the e-cache data ECC protected and the tags were on chip and only had parity checking. Also there were bad modules with flakey E$ at one point. Sun provided customers that had problems that the enhanced RAS in Solaris 9 did not solve with replacement modules with mirrored SRAM for the e-cache.
I did quite a bit of work on MCA for Itanium on Linux and it's a lot harder to do than you might think. The Itanium MCA event can occur at any time, no matter what the OS is currently doing. Locks, preempt disable, interrupt disable etc., none of those will stop an Itanium MCA event from occurring.
Whan an MCA occurs, the OS can be in any state, it may not even have a valid stack at that point. I have seen MCAs being raised right in the middle of the code that switches the cpu from one process to another or in the middle of saving the user process's state and before switching to kernel state. The only way to handle this was to define a special MCA stack frame to do the error checking and recovery on. For some scary code, see the Linux kernel, arch/ia64/mca.c and arch/ia64/mca_asm.S.
Even after handling the stack switch problems, on Itanium you have no real idea what state the OS is in. The OS could have locks on critical code which prevent the MCA recovery from doing any useful work. MCA recovery is a nice idea but implementation is a bitch.