Intel's Nehalem EX To Gain Error Correction
angry tapir writes "Intel's eight-core Nehalem EX server processor will include a technology derived from its high-end Itanium chips that helps to reduce data corruption and ensure reliable server performance. The processor will include an error correction feature called MCA Recovery, which will detect and fix errors that could otherwise cause systems to crash — it will be able to detect system errors originating in the CPU or system memory and work with the operating system to correct them." Update: 05/27 19:11 GMT by T : Dave Altavilla suggests also Hot Hardware's coverage of the new chip, which includes quite a bit more information.
This is nice and all; but I, for one, will not be satisfied until Intel releases a CPU that does what I mean, not what I say.
The article seemed pretty light on details of what MCA Recovery actually does. I found this presentation in PDF format that seems to go into some more useful detail about what this is. It's not just ECC to repair single-bit errors (although that is part of it, apparently). It also includes features to recover from errors that cannot simply be corrected. For example it includes a mechanism to notify the OS of the details of an uncorrectable error, so that it could presumably re-load a page full of program code from disk, or terminate an application if its data has been corrupted, instead of shutting down the whole machine.
Error correction on an x86 chip?
Sweet. Now all those high-end server applications running on x86s that need great uptime can finally join the big boys. [rolls eyes].
Is the demand for x86 Server chips that high? I thought anyone requiring 5 nines (or anything close to it) would never consider using x86?
The story of the server market for the last 10+ years is simple: x86 has been eating everyone else's market share from the bottom up. Commodity pricing > perceived advantages of the proprietary RISC vendors. To the extent that there are real necessary features x86 lacked, it has acquired them as necessary.
There's been correctable ECC on x86 server chips for years. x86 has long since moved up-market past the point where basic RAS features (like ECC) are mandatory. Intel's Xeon has had these features for a long time. AMD Barcelona core was the first to have correctable ECC in the L1 caches -- before it could detect errors but couldn't fix them.
Basically the only new feature here is the ability to notify the OS about uncorrectable errors so that the OS can try to fix the problem by nuking the affected app, reloading a code page from disk or whatever else is appropriate so that a system reboot isn't always necessary on uncorrectable errors.
Yeah this is something the "big boys" already had, fat consolation that will be now that x86 is poised to eat their lunch. Not even Intel themselves could reverse the trend when they tried. They could use features like this to differentiate Itanium all they want, at the end of the day the customer says "yeah that's great, but can you do it in an x86 chip?" This is just them bowing to the demands of the market (in order to make mega $$).
The enemies of Democracy are
x86 is slow and under performing architecture, and I am surprise that Intel is bolting error correction on top of it.
Hogwash. There's nothing inherently slow about x86. The ISA is nothing but an interface. Internally, the CISC instructions are decoded into simple micro-ops, so all the predictions about how x86 would fall behind because it wouldn't be able to have out of order execution etc were proven wrong. It's not easy to make x86 chips, but the difficult performance problems have been solved.
So don't be surprised, it's just another step in the plain obvious trend that has been going on for over a decade now. With no performance disadvantage, and a big price advantage, x86 has been moving into the server market in a big way. The only thing holding it back is the lack of RAS features, which are just as easy to "bolt on" to x86 as any other instruction set. It's just there was no reason to add these features for desktop or low-end servers.
The Intel instruction set is so complicated that often times a single bit being flipped means it is still a very much valid opcode which when executed will do something completely different from what you expect it to do.
The same is true of RISC, flip a bit in the opcode field and there's a good chance it's still a valid opcode. Not that it matters one whit; flipped bits in the instruction stream are detected via ECC in the instruction cache, not by praying the decoders see it as an invalid instruction.
This seems to be nothing short of a stopgap measure for not losing more customers to the big iron manufacturers like Sun and IBM who both have their own CPU's that were built with stability in mind.
FUD like this is nothing but a stopgap measure for the RISC vendors to lose customers a little more slowly to x86 than they already are. Of course rather than just losing customers, Sun and IBM (and other former RISC vendors) sell solutions that use x86. It's only a matter of time before this trend hits even the "big iron". As x86 erodes their margins from beneath, for how long will it make sense to spend the money to develop the RISC chips for an ever-decreasing slice of the pie? Eventually it makes more sense to just demand that Intel add whatever RAS features it lacks compared to the RISC chip it'll be replacing, which is exactly what is happening here (only in this case it's EPIC that's on the chopping block).
Apple being able to transition from PowerPC to x86 is quite a feat, but x86 transitioning to the next big thing is going to be impossible without at least backwards compatibility in the form of x86 emulation, and boy is the x86 instruction set fun to emulate!
Well you certainly got that right. The only real disadvantage of x86 itself is that it is a huge pain in the ass to make work properly, and a lot of the magic isn't in the ISA docs but rather in the institutional knowledge of the two remaining firms that make the chips. x86 raises the already incredibly high barrier to entry for new chip manufacturers. That, not performance or (potential) reliability, is the reason x86 sucks.
The enemies of Democracy are