Intel's Nehalem EX To Gain Error Correction
angry tapir writes "Intel's eight-core Nehalem EX server processor will include a technology derived from its high-end Itanium chips that helps to reduce data corruption and ensure reliable server performance. The processor will include an error correction feature called MCA Recovery, which will detect and fix errors that could otherwise cause systems to crash — it will be able to detect system errors originating in the CPU or system memory and work with the operating system to correct them." Update: 05/27 19:11 GMT by T : Dave Altavilla suggests also Hot Hardware's coverage of the new chip, which includes quite a bit more information.
So, is this an effective replacement for ECC memory?
MS detected
Uninstalling
The world is made by those who show up for the job.
Floating Points Errors?
This is nice and all; but I, for one, will not be satisfied until Intel releases a CPU that does what I mean, not what I say.
Error correction on an x86 chip?
Sweet. Now all those high-end server applications running on x86s that need great uptime can finally join the big boys. [rolls eyes].
I'm just not sure of the utility here -- I RTFA, but I'm still not clear on why Intel would cannibalize Itanium sales (new release delayed again) by offering error correction on Nehalem chips. Is the demand for x86 Server chips that high? I thought anyone requiring 5 nines (or anything close to it) would never consider using x86?
Can someone with more knowledge of the high-end server market please clarify?
"Trolls they were, but filled with the evil will of their master: a fell race..." -- J.R.R. Tolkien on Olog-hai
Yes but can it correct for PEBKAC?
Sorry Mr. User -- That tray is not for your coffee cup - I am now deleting your profile -- Have a nice day!
"i lost my dignity on a slippery wiener"
The article seemed pretty light on details of what MCA Recovery actually does. I found this presentation in PDF format that seems to go into some more useful detail about what this is. It's not just ECC to repair single-bit errors (although that is part of it, apparently). It also includes features to recover from errors that cannot simply be corrected. For example it includes a mechanism to notify the OS of the details of an uncorrectable error, so that it could presumably re-load a page full of program code from disk, or terminate an application if its data has been corrupted, instead of shutting down the whole machine.
High-end, low-end, middle, um...end...whatever.
The goal is not to create perfection, but gracefully recover from imperfection as if nothing happened. I see no problem with bolting on such features to the world's most common processing platform. We can all use such graceful recovery features, not just servers and "high-end" applications. Will the average use need an 8-core CPU? Probably not, but it certainly wouldn't hurt them, either. Intel then can trickle this down to the average user and help all of us support folks to have a nicer day.
Short of getting rid of users, let's at least minimize the problems they will suffer. When they have a good day, they leave ME alone to my machinations.
Bearded Dragon
*sniff* x86 is getting to be so grown up *sniff* I remember when it was just a little 16 bit chip.
And Nehalem is an all in-order design, so they can scale out to very large numbers of cores or register-and-decoder sets on a single chip. That helps offset the huge bottleneck of trying to go to molasses-slow main memory on every cache miss, by allowing another thread to run. Something I notice is also true of the newest Power chip. Mind you, I'd want enough cores to host 128 threads in order to at least match the new SPARCs, but that can come along later (;-))
--dave
davecb@spamcop.net
And Nehalem is an all in-order design, so they can scale out to very large numbers of cores or register-and-decoder sets on a single chip. That helps offset the huge bottleneck of trying to go to molasses-slow main memory on every cache miss, by allowing another thread to run. Mind you, I'd want enough cores to host 128 threads in order to at least match the new SPARCs, but that can come along later (;-))
You must be thinking of Atom, because Nehalem is definitely an out-of-order processor and not particularly small either. It does use SMT (and a big instruction window) to hide memory latency (and to keep its 4-wide execution engine busy), but that's having multiple threads running on the same core.
Frankly while Niagra is a very interesting approach that I think will only become more popular in the future (and Atom is theoretically capable of doing the same thing though right now it's just embedded stuff), for now there are many server apps where single-thread performance still matters greatly and for that out-of-order is the way to go (as Intel found out the hard way by trying every trick in the book to make an in-order machine fast enough).
The enemies of Democracy are
Thanks, I was indeed thinking of Atom. For some reason I associated them with one another...
I double-checked, and the new power chip is (mostly) in-order, even at the cost of giving away clock speed.
I'll be interested in seeing what IBM is up to in the Power 7 time period.
davecb@spamcop.net
Imagine a Beowulf cluster of these!
Chas - The one, the only.
THANK GOD!!!
You can see it now. Once upon a time, a computer intelligence was given the power to control its destiny. This intelligence was deemed so substantial that it was the best commander of the greatest weapons. You know this intelligence as Skynet, which launched nuclear missiles in order to a threat to itself, a sort of error detection and correction, if you will, with the utmost power that man can endow to a machine. What you don't know was the actual error that was detected, an error with the code PEBKAC. PEBKAC? PEBKAC. Only an intelligent computer can detect PEBKAC. And now you know the rest of the story.
Know your pads. One time pad: good for cryptography. Two timing pad: where to take your mistress.
I did quite a bit of work on MCA for Itanium on Linux and it's a lot harder to do than you might think. The Itanium MCA event can occur at any time, no matter what the OS is currently doing. Locks, preempt disable, interrupt disable etc., none of those will stop an Itanium MCA event from occurring.
Whan an MCA occurs, the OS can be in any state, it may not even have a valid stack at that point. I have seen MCAs being raised right in the middle of the code that switches the cpu from one process to another or in the middle of saving the user process's state and before switching to kernel state. The only way to handle this was to define a special MCA stack frame to do the error checking and recovery on. For some scary code, see the Linux kernel, arch/ia64/mca.c and arch/ia64/mca_asm.S.
Even after handling the stack switch problems, on Itanium you have no real idea what state the OS is in. The OS could have locks on critical code which prevent the MCA recovery from doing any useful work. MCA recovery is a nice idea but implementation is a bitch.
It refuses to run windows?
So, its already in the kernel in the ia64 arch. How difficult would it be to move that into the x86_64 branch? ( assuming amd follows suit in a reasonable amount of time)