Slashdot Mirror


Intel's Nehalem EX To Gain Error Correction

angry tapir writes "Intel's eight-core Nehalem EX server processor will include a technology derived from its high-end Itanium chips that helps to reduce data corruption and ensure reliable server performance. The processor will include an error correction feature called MCA Recovery, which will detect and fix errors that could otherwise cause systems to crash — it will be able to detect system errors originating in the CPU or system memory and work with the operating system to correct them." Update: 05/27 19:11 GMT by T : Dave Altavilla suggests also Hot Hardware's coverage of the new chip, which includes quite a bit more information.

24 of 80 comments (clear)

  1. Not nearly good enough... by fuzzyfuzzyfungus · · Score: 5, Funny

    This is nice and all; but I, for one, will not be satisfied until Intel releases a CPU that does what I mean, not what I say.

    1. Re:Not nearly good enough... by causality · · Score: 2, Funny

      Why don't you just say what you mean, for the time being?

      That's too easy. We'll never advance the state of the art with that kind of thinking!

      --
      It is a miracle that curiosity survives formal education. - Einstein
    2. Re:Not nearly good enough... by dzfoo · · Score: 2, Funny

      So does your computer currently do what you say?

      Mine, I can barely get it to do what I type!

              -dZ.

      --
      Carol vs. Ghost
      ...Can you save Christmas?
  2. x86 by Red+Flayer · · Score: 2, Insightful

    Error correction on an x86 chip?

    Sweet. Now all those high-end server applications running on x86s that need great uptime can finally join the big boys. [rolls eyes].

    I'm just not sure of the utility here -- I RTFA, but I'm still not clear on why Intel would cannibalize Itanium sales (new release delayed again) by offering error correction on Nehalem chips. Is the demand for x86 Server chips that high? I thought anyone requiring 5 nines (or anything close to it) would never consider using x86?

    Can someone with more knowledge of the high-end server market please clarify?

    --
    "Trolls they were, but filled with the evil will of their master: a fell race..." -- J.R.R. Tolkien on Olog-hai
    1. Re:x86 by Amouth · · Score: 3, Insightful

      thats it.. i don't think this is aimed at the "high end" but rather at the middle ground..

      people running farms or VM's or even large DB's but not exactly in need of mainframe or HPC.

      while i agree there are alot of options other than x86.. x86 is growing and isn't going to go away.. and the EMT64 has just solidified it.. adding something like this is a welcomed evolution of the area.

      and they arn't canabilizing the Itanium sales - while yes the Itanium is selling better than before.. there is no where near the market for it as x86 chips.

      --
      '...if only "Jumping to a Conclusion" was an event in the Olympics.'
    2. Re:x86 by 0x000000 · · Score: 2, Insightful

      The more interesting thing is to see how this technology is going to work and whether other manufacturers will be able to implement this in their chips.

      x86 is slow and under performing architecture, and I am surprise that Intel is bolting error correction on top of it. The Intel instruction set is so complicated that often times a single bit being flipped means it is still a very much valid opcode which when executed will do something completely different from what you expect it to do.

      This seems to be nothing short of a stopgap measure for not losing more customers to the big iron manufacturers like Sun and IBM who both have their own CPU's that were built with stability in mind.

      x86 has moved into areas where it simply is not going to shine as brilliantly as it did on the desktop. The only issue is that moving to a new platform is going to be catastrophic in that too many people rely on it. Apple being able to transition from PowerPC to x86 is quite a feat, but x86 transitioning to the next big thing is going to be impossible without at least backwards compatibility in the form of x86 emulation, and boy is the x86 instruction set fun to emulate!

      --
      cat /dev/null > .signature
    3. Re:x86 by Lally+Singh · · Score: 4, Insightful

      They're not, nobody buys Itanium. They're going after SPARC and POWER. Lots of people are looking at the speed and throughput of modern x86 and noticing the price difference. Especially in this economy.

      And with Ellison in control of SPARC, it's the best way to go.

      --
      Care about electronic freedom? Consider donating to the EFF!
    4. Re:x86 by Chris+Burke · · Score: 5, Insightful

      Error correction on an x86 chip?

      Sweet. Now all those high-end server applications running on x86s that need great uptime can finally join the big boys. [rolls eyes].

      Is the demand for x86 Server chips that high? I thought anyone requiring 5 nines (or anything close to it) would never consider using x86?

      The story of the server market for the last 10+ years is simple: x86 has been eating everyone else's market share from the bottom up. Commodity pricing > perceived advantages of the proprietary RISC vendors. To the extent that there are real necessary features x86 lacked, it has acquired them as necessary.

      There's been correctable ECC on x86 server chips for years. x86 has long since moved up-market past the point where basic RAS features (like ECC) are mandatory. Intel's Xeon has had these features for a long time. AMD Barcelona core was the first to have correctable ECC in the L1 caches -- before it could detect errors but couldn't fix them.

      Basically the only new feature here is the ability to notify the OS about uncorrectable errors so that the OS can try to fix the problem by nuking the affected app, reloading a code page from disk or whatever else is appropriate so that a system reboot isn't always necessary on uncorrectable errors.

      Yeah this is something the "big boys" already had, fat consolation that will be now that x86 is poised to eat their lunch. Not even Intel themselves could reverse the trend when they tried. They could use features like this to differentiate Itanium all they want, at the end of the day the customer says "yeah that's great, but can you do it in an x86 chip?" This is just them bowing to the demands of the market (in order to make mega $$).

      --

      The enemies of Democracy are
    5. Re:x86 by Chris+Burke · · Score: 5, Informative

      x86 is slow and under performing architecture, and I am surprise that Intel is bolting error correction on top of it.

      Hogwash. There's nothing inherently slow about x86. The ISA is nothing but an interface. Internally, the CISC instructions are decoded into simple micro-ops, so all the predictions about how x86 would fall behind because it wouldn't be able to have out of order execution etc were proven wrong. It's not easy to make x86 chips, but the difficult performance problems have been solved.

      So don't be surprised, it's just another step in the plain obvious trend that has been going on for over a decade now. With no performance disadvantage, and a big price advantage, x86 has been moving into the server market in a big way. The only thing holding it back is the lack of RAS features, which are just as easy to "bolt on" to x86 as any other instruction set. It's just there was no reason to add these features for desktop or low-end servers.

      The Intel instruction set is so complicated that often times a single bit being flipped means it is still a very much valid opcode which when executed will do something completely different from what you expect it to do.

      The same is true of RISC, flip a bit in the opcode field and there's a good chance it's still a valid opcode. Not that it matters one whit; flipped bits in the instruction stream are detected via ECC in the instruction cache, not by praying the decoders see it as an invalid instruction.

      This seems to be nothing short of a stopgap measure for not losing more customers to the big iron manufacturers like Sun and IBM who both have their own CPU's that were built with stability in mind.

      FUD like this is nothing but a stopgap measure for the RISC vendors to lose customers a little more slowly to x86 than they already are. Of course rather than just losing customers, Sun and IBM (and other former RISC vendors) sell solutions that use x86. It's only a matter of time before this trend hits even the "big iron". As x86 erodes their margins from beneath, for how long will it make sense to spend the money to develop the RISC chips for an ever-decreasing slice of the pie? Eventually it makes more sense to just demand that Intel add whatever RAS features it lacks compared to the RISC chip it'll be replacing, which is exactly what is happening here (only in this case it's EPIC that's on the chopping block).

      Apple being able to transition from PowerPC to x86 is quite a feat, but x86 transitioning to the next big thing is going to be impossible without at least backwards compatibility in the form of x86 emulation, and boy is the x86 instruction set fun to emulate!

      Well you certainly got that right. The only real disadvantage of x86 itself is that it is a huge pain in the ass to make work properly, and a lot of the magic isn't in the ISA docs but rather in the institutional knowledge of the two remaining firms that make the chips. x86 raises the already incredibly high barrier to entry for new chip manufacturers. That, not performance or (potential) reliability, is the reason x86 sucks.

      --

      The enemies of Democracy are
    6. Re:x86 by Anonymous Coward · · Score: 4, Insightful

      x86 is slow and under performing architecture

      So right there you've destroyed your credibility. You couldn't be any more wrong if your name was W. Wrongy Wrongenstein.

      Right now, x86 processors are the highest performance in the world.

      and I am surprise that Intel is bolting error correction on top of it

      Well, that just shows you aren't paying attention to the trends of where x86 is going any more than you've been paying attention to its performance. x86 has been gradually moving up market into higher and higher tiers of servers for well over a decade now.

      The Intel instruction set is so complicated that often times a single bit being flipped means it is still a very much valid opcode which when executed will do something completely different from what you expect it to do.

      And now we see that you don't have much clue about instruction set encoding, either.

      There is literally no commercially viable instruction set for which the above is NOT true. Look at a traditional RISC instruction set with 3 operands and 32 GPRs. Almost half of the bits (15 of them) in every 32-bit ALU instruction for such a processor are register addresses. Flip any of those bits and the register address is still valid -- there are no invalid addresses, so the processor can't tell the difference between the wrong address and the right one. The remainder of the bits in such an instruction are typically instruction format select, opcode select, and miscellaneous control bits. Flip an opcode bit and you'll get the wrong ALU op, more often than not... processor designers leave some room for adding opcodes, but typically not a lot.

      See, the only way an instruction set can guard against bit flips is not by simplicity (as you implicitly claim), it's by being horribly wasteful. When people design instruction encodings, they look at the width of all the bit fields in each instruction format and use the smallest they can get away with. Instruction sets which aren't efficiently packed aren't any good: they use more memory to store program code, have reduced effective icache size for the same number of bits in silicon, tend to have major clumsiness (such as too-small immediate operand sizes, or too-small relative branch windows),and so forth. Efficient packing always means there are very few invalid bit patterns for each field in the instruction; if you have a lot of invalid patterns you probably could be packing the instruction tighter. Few invalid patterns means that most bit flips still produce a valid instruction.

      This seems to be nothing short of a stopgap measure for not losing more customers to the big iron manufacturers like Sun and IBM who both have their own CPU's that were built with stability in mind.

      Idiot. Intel isn't losing big iron marketshare to IBM and Sun. It's taking big iron marketshare from them. Adding big iron RAS features to x86 is the next step in that trend.

      x86 has moved into areas where it simply is not going to shine as brilliantly as it did on the desktop. The only issue is that moving to a new platform is going to be catastrophic in that too many people rely on it. Apple being able to transition from PowerPC to x86 is quite a feat, but x86 transitioning to the next big thing is going to be impossible without at least backwards compatibility in the form of x86 emulation, and boy is the x86 instruction set fun to emulate!

      1990 called, and it wants its foolish predictions of where x86 cannot go back.

      Much better informed people than you thought, back then, that x86 could never be a workstation or server CPU in any capacity at all. It was just a personal computer processor, and a rather ugly and slow one at that.

      Instead, Intel proved they could make fast x86 processors, and steadily increased x86 presence in the workstation and low end server market throughout the 90s, with an assis

    7. Re:x86 by bloodhawk · · Score: 2, Insightful

      I am working in an organisation that found themselves unlucky enough to fall for the HP Bullshit about how wonderful itanium was, We spent close to 2 million on high end itanium boxes over the past few years, We have now classified ALL of them as up for asset replacement so we can get rid of the bastards as early as possible (2 years before normal end of life for us). So many vendors simply don't have software that works on Itanium or works in a more limited cut down way, hell even MS which supposedly supports them doesn't have most of there software as itanium compatible.

  3. Re:ECC memory replacement? by Anonymous Coward · · Score: 4, Informative

    This will fix many errors affecting the processor itself (new manufacturing processes make transistors quite vulnerable to interference and aging). ECC will still be needed for correcting errors affecting data while it is stored in main memory.

    Parity will be needed for protecting caches (possibly ECC will be used in the future). Checksums for data on the hard drive. CRCs for packets on the network. And so on...

  4. Re:ECC memory replacement? by Anonymous Coward · · Score: 2, Informative

    No. ECC only corrects certain issues in the memory. It cannot help with memory controller errors, nor with register or TLB errors.

  5. More detail on MCA Recovery by FishBike · · Score: 5, Informative

    The article seemed pretty light on details of what MCA Recovery actually does. I found this presentation in PDF format that seems to go into some more useful detail about what this is. It's not just ECC to repair single-bit errors (although that is part of it, apparently). It also includes features to recover from errors that cannot simply be corrected. For example it includes a mechanism to notify the OS of the details of an uncorrectable error, so that it could presumably re-load a page full of program code from disk, or terminate an application if its data has been corrupted, instead of shutting down the whole machine.

    1. Re:More detail on MCA Recovery by mzs · · Score: 3, Informative

      Read the fmd, fmadm, and fmstat man pages on Solaris. There is also at least one memory scrubber kthread and you can look at memscrub_scans_done to see how far it has gone along. Lots of hardware is being checked periodically, in fact on some hardware even the FP units of the processors are periodically checked for faults. Some sparcs even have instruction retry in the case of a detected error. There is even memory mirroring on M4000 and above servers, that is like RAID-1 for memory, say a chip on a DIMM fails, you still can run, then use fmadm and replace the faulty DIMM. There are also the sorts of things you outlined above where a page is reread if not modified and only causes a SIGSEGV if that page is ever used again. In ZFS there is end to end hashing to detect and correct errors.

      Of course all of this pales to what has been available on mainframes for a generation.

  6. Re:ECC memory replacement? by davecb · · Score: 4, Insightful

    I'm a bit surprised this is only seeing the light now: as we get smaller and faster, the number of errors observed goes up amazingly.

    Back in the stone age, Cray computers didn't even have parity memory, partly because they were willing to re-run programs but mostly because errors were unlikely. Cray himself famously said "parity is for farmers".

    These days, errors are very common, and I'm literally amazed that x86s don't have better-than-ECC error detection and correction. All the commercial Unix vendors have them.

    --dave

    --
    davecb@spamcop.net
  7. Re:ECC memory replacement? by Jah-Wren+Ryel · · Score: 4, Informative

    These days, errors are very common, and I'm literally amazed that x86s don't have better-than-ECC error detection and correction. All the commercial Unix vendors have them.

    Intel's been trying to 'protect' the market for itanium - those cpus have had it for years, probably from day 1. HP definitely markets MCA has a big feature of their itanium based systems.

    If AMD were smart, they would have incorporated it into their Opteron line just like they did x64 to cut Intel off at the knees.

    --
    When information is power, privacy is freedom.
  8. Re:ECC memory replacement? by mzs · · Score: 2, Informative

    State of the non-mainframe art with regards to RAS right now is ECC RAM with mirroring, parity cache, ECC e-cache, hashes that detect and fix multiple bit errors for storage end to end, CRC (ethernet) and cksum (TCP, UDP) (but can you trust the nic offloading engine?), instruction retry, and fp scrubbing, in addition to what has been around for the last five years or so.

  9. x86 by confused+one · · Score: 2, Funny

    *sniff* x86 is getting to be so grown up *sniff* I remember when it was just a little 16 bit chip.

  10. Re:x86 coming up from below by Chris+Burke · · Score: 2, Insightful

    And Nehalem is an all in-order design, so they can scale out to very large numbers of cores or register-and-decoder sets on a single chip. That helps offset the huge bottleneck of trying to go to molasses-slow main memory on every cache miss, by allowing another thread to run. Mind you, I'd want enough cores to host 128 threads in order to at least match the new SPARCs, but that can come along later (;-))

    You must be thinking of Atom, because Nehalem is definitely an out-of-order processor and not particularly small either. It does use SMT (and a big instruction window) to hide memory latency (and to keep its 4-wide execution engine busy), but that's having multiple threads running on the same core.

    Frankly while Niagra is a very interesting approach that I think will only become more popular in the future (and Atom is theoretically capable of doing the same thing though right now it's just embedded stuff), for now there are many server apps where single-thread performance still matters greatly and for that out-of-order is the way to go (as Intel found out the hard way by trying every trick in the book to make an in-order machine fast enough).

    --

    The enemies of Democracy are
  11. Christ! Can't believe anyone hasn't used this yet by Chas · · Score: 2, Funny

    Imagine a Beowulf cluster of these!

    --


    Chas - The one, the only.
    THANK GOD!!!
  12. Re:ECC memory replacement? by Chris+Burke · · Score: 2, Informative

    The original Opteron had L1 ECC, it just wasn't correctable if encountered on a read or write (there was a scrubber that would find and correct ECC errors, but if it didn't reach the line in question before the program accessed the cache line, then it would detect the error and machine check fault). The ill-fated Barcelona (Phenom) added on-the-fly correctability. Phenom 2 of course has it too.

    I was pretty sure Intel had it in their L1s too. Kinda surprised to hear SPARC doesn't.

    P.S. I know The Inquirer decided it was the K10, but it isn't. They're still all K8s.

    --

    The enemies of Democracy are
  13. Re:ECC memory replacement? by mzs · · Score: 2, Interesting

    Sure enough it is in the Phenom datasheet, thank you.

    As far as I know T1, T2, and T2+ all have only parity for the I$ and D$. All the Fujitsu sparcs that I know of only have parity for I$ and D$ as well. ECC e-cache is the norm though.

    Sparc was odd. They had all sorts of strange caches from one model to the next. Sometimes there was an I$ and D$, sometimes it was unified. Sometimes some caches were virtually tagged. There was an ultrasparc that had the e-cache data ECC protected and the tags were on chip and only had parity checking. Also there were bad modules with flakey E$ at one point. Sun provided customers that had problems that the enhanced RAS in Solaris 9 did not solve with replacement modules with mirrored SRAM for the e-cache.

  14. Itanium MCA is a lot harder than you think by Anonymous Coward · · Score: 2, Informative

    I did quite a bit of work on MCA for Itanium on Linux and it's a lot harder to do than you might think. The Itanium MCA event can occur at any time, no matter what the OS is currently doing. Locks, preempt disable, interrupt disable etc., none of those will stop an Itanium MCA event from occurring.

    Whan an MCA occurs, the OS can be in any state, it may not even have a valid stack at that point. I have seen MCAs being raised right in the middle of the code that switches the cpu from one process to another or in the middle of saving the user process's state and before switching to kernel state. The only way to handle this was to define a special MCA stack frame to do the error checking and recovery on. For some scary code, see the Linux kernel, arch/ia64/mca.c and arch/ia64/mca_asm.S.

    Even after handling the stack switch problems, on Itanium you have no real idea what state the OS is in. The OS could have locks on critical code which prevent the MCA recovery from doing any useful work. MCA recovery is a nice idea but implementation is a bitch.