Slashdot Mirror


Errata Prompts Intel To Disable TSX In Haswell, Early Broadwell CPUs

Dr. Damage writes: The TSX instructions built into Intel's Haswell CPU cores haven't become widely used by everyday software just yet, but they promise to make certain types of multithreaded applications run much faster than they can today. Some of the savviest software developers are likely building TSX-enabled software right about now. Unfortunately, that work may have to come to a halt, thanks to a bug—or "errata," as Intel prefers to call them—in Haswell's TSX implementation that can cause critical software failures. To work around the problem, Intel will disable TSX via microcode in its current CPUs — and in early Broadwell processors, as well.

9 of 131 comments (clear)

  1. Not all that surprising... by K.+S.+Kyosuke · · Score: 5, Interesting

    So, basically, they've just been forced to get rid of the most complex (that's why it's not all that surprising) yet also most beneficial feature with regards to server loads? I'm sure there are some Opterons laughing right now.

    --
    Ezekiel 23:20
    1. Re:Not all that surprising... by gstoddart · · Score: 5, Funny

      What of the folks that purchased these chips for these specific instructions?

      Same as happens to all early adopters -- the feature may or may not work, and even if it does, there's no guarantee it will be supported (or the same) in the next version.

      This is a pretty big 'errata', which is an awesome marketing speak for "really bad QA".

      Engineers Release Really Awful Tech. Awesome!

      --
      Lost at C:>. Found at C.
    2. Re:Not all that surprising... by gman003 · · Score: 4, Informative

      I'm sure there are some Opterons laughing right now.

      Yes, but some of them take a while to get the joke because their TLB had to be disabled.

      (Certain releases of the "Barcelona" Opterons had a bug that could lock up the system. A workaround would prevent it, but had a stiff performance penalty. Later steppings had it fixed.)

    3. Re:Not all that surprising... by ShanghaiBill · · Score: 4, Informative

      See also Pentium 5 and the FDIV bug. It falls under "too bad, so sad, try your luck with the next revision".

      No. Intel offered to replace any P5 with the FDIV bug upon request. Most customers did not request a replacement, but the option was available.

    4. Re:Not all that surprising... by CajunArson · · Score: 5, Insightful

      Nobody has been robbed.
      TSX today works exactly as well as TSX worked yesterday, and considering that Haswell has been on the market for over 1 year, I assure you that anybody who has been chomping at the bit to use TSX has been using TSX.

      If the TSX erratum were trivially easy to trigger, then this article would have been posted last spring before Haswell even launched.

      Intel has done the responsible thing by acknowledging the bug (trust me son, AMD & Nvidia often don't bother with that part of the process) and giving developers the OPTION to either use TSX as-is or disable it to ensure that it cannot cause instability no matter what weird operating conditions can occur.

      Tell ya what, why don't you take all your nerd-rage over to AMD or ARM where they won't rob you of all kinds of advanced features that they just don't bother to implement at all.

      --
      AntiFA: An abbreviation for Anti First Amendment.
    5. Re:Not all that surprising... by EvilJoker · · Score: 4, Informative

      I know this was a troll, but I feel compelled to reply in case someone doesn't know.

      ALL CPUs have errata. Some of it more significant than others.

      A quick Google for "AMD errata" revealed Revision Guide for AMD Family 16h Models 00h-0Fh, published June 2013, and applying to AMD's Mobile A,E, and G series, and Opteron X1100/X2100 (These are modern CPUs)

      There are 21 entries, with descriptions, system impact, and suggested workaround (if any)

      Haswell's errata has 131 entries

    6. Re:Not all that surprising... by Anonymous Coward · · Score: 4, Informative

      See also Pentium 5 and the FDIV bug. It falls under "too bad, so sad, try your luck with the next revision".

      No. Intel offered to replace any P5 with the FDIV bug upon request. Most customers did not request a replacement, but the option was available.

      Not at first they didn't.

      My friend was doing his master on neural networks (?) at the time and some of his algorithms were giving back hinky results, especially when he compared them to some of the SPARC systems.

      He had to actually provide documentation that it effected him, and I think sign an NDA, before Intel would give him anything. He jumped through their hoops to get a replacement, and then the very next week Intel announced their carte blanche replacement program.

      It took much screaming in the industry before Intel became "generous".

    7. Re:Not all that surprising... by rrohbeck · · Score: 4, Informative

      Singular: Erratum
      Plural: Errata

  2. Problem and possible alternatives by enriquevagu · · Score: 5, Informative

    This is a real pity for the TM community. This is not the first chip with transactional memory support in hardware: The Sun Rock was announced to have hardware TM support, and the IBM Blue Gene/Q Compute chip also supports it. Unlike other proposals for unbounded transactional memory, all these systems employ Hybrid Transactional Memory (ref, ref, ref), in which restricted hardware transactions are designed to correctly coexist with unbounded software transactions, so a software transaction can be started in case a hardware transaction fails for some unavoidable issue (such as lack of cache size or associativity to hold speculative data from the transaction, not because of a conflict). Note that, in any case, very large transactions should arguably be very uncommon, since they would significantly reduce performance (similar to very large critical sections protected by locks).

    The problem with the hardware implementation of transactional memory is that they are not simply a new set of instructions which are independent from the rest of the processor. HTM implies multiple aspects, including multiversioning caching for speculative data; allowing for the commit of speculative (transactional) instructions, which could be later rolled back (note that in any other speculative operation such as instructions after branch prediction, the speculation is always resolved before instruction commits because the branch commits earlier); a tight integration with the coherence protocol (see LogTM-SE for an alternative to this very last issue, but still...); a mechanism to support atomic commits in presence of coherence invalidations... From the point of view of processor verification, this is a complete nightmare because these new "extensions" basically impact the complete processor pipeline and coherence protocol, and verifying that every single instruction and data structure behaves as expected in isolation does not guarantee that they will operate correctly in presence of multiple transactions (and non-transactional conflicting code) in multiple cores. There are some formal studies such as this or this, and the IBM people discuss the verification of their Blue Gene TM system in this paper (paywalled).

    As some others commented before, the nature of the "bug" has not been disclosed. However, since it seems to be easy to reproduce systematically, I would expect it to be related to incorrect speculative data handling in a single transaction (or something similar), rather than races between multiple transactions.

    Regarding the alternatives, Intel cannot simply remove these instructions opcodes because previous code would fail. I assume that the patch will make all hardware transactions fail on startup, with an specific error (EAX bit 1 indicates if the transaction can succeed on a retry; setting this flag to 0 should trigger a software transaction). In such case, execution continues at the fallback routine indicated in the XBEGIN instruction, which should begin a software transaction. Effectively, this will be similar to a software TM (STM) with additional overheads (starting the hardware transaction and aborting it; detecting conflicts with nonexistent hardware transactions) that would make it slower than a pure STM implementation.