Errata Prompts Intel To Disable TSX In Haswell, Early Broadwell CPUs

← Back to Stories (view on slashdot.org)

Errata Prompts Intel To Disable TSX In Haswell, Early Broadwell CPUs

Posted by Soulskill on Tuesday August 12, 2014 @07:09AM from the somebody-is-getting-fired dept.

Dr. Damage writes: The TSX instructions built into Intel's Haswell CPU cores haven't become widely used by everyday software just yet, but they promise to make certain types of multithreaded applications run much faster than they can today. Some of the savviest software developers are likely building TSX-enabled software right about now. Unfortunately, that work may have to come to a halt, thanks to a bug—or "errata," as Intel prefers to call them—in Haswell's TSX implementation that can cause critical software failures. To work around the problem, Intel will disable TSX via microcode in its current CPUs — and in early Broadwell processors, as well.

32 of 131 comments (clear)

Min score:

Reason:

Sort:

Not all that surprising... by K.+S.+Kyosuke · 2014-08-12 07:18 · Score: 5, Interesting

So, basically, they've just been forced to get rid of the most complex (that's why it's not all that surprising) yet also most beneficial feature with regards to server loads? I'm sure there are some Opterons laughing right now.

--
Ezekiel 23:20
1. Re:Not all that surprising... by Rockoon · 2014-08-12 07:23 · Score: 2
  
  What of the folks that purchased these chips for these specific instructions? Surely many optimization experts (...assembler gurus) are going to feel quite burned...
  
  --
  "His name was James Damore."
2. Re:Not all that surprising... by gstoddart · 2014-08-12 07:28 · Score: 5, Funny
  
  What of the folks that purchased these chips for these specific instructions?
  Same as happens to all early adopters -- the feature may or may not work, and even if it does, there's no guarantee it will be supported (or the same) in the next version.
  This is a pretty big 'errata', which is an awesome marketing speak for "really bad QA".
  Engineers Release Really Awful Tech. Awesome!
  
  --
  Lost at C:>. Found at C.
3. Re:Not all that surprising... by K.+S.+Kyosuke · 2014-08-12 07:45 · Score: 2
  
  I almost became one of those people, that's why I'm mentioning it.
  
  --
  Ezekiel 23:20
4. Re:Not all that surprising... by gman003 · 2014-08-12 08:05 · Score: 4, Informative
  
  I'm sure there are some Opterons laughing right now.
  Yes, but some of them take a while to get the joke because their TLB had to be disabled.
  (Certain releases of the "Barcelona" Opterons had a bug that could lock up the system. A workaround would prevent it, but had a stiff performance penalty. Later steppings had it fixed.)
5. Re:Not all that surprising... by ShanghaiBill · 2014-08-12 08:09 · Score: 4, Informative
  
  See also Pentium 5 and the FDIV bug. It falls under "too bad, so sad, try your luck with the next revision".
  No. Intel offered to replace any P5 with the FDIV bug upon request. Most customers did not request a replacement, but the option was available.
6. Re:Not all that surprising... by CajunArson · 2014-08-12 08:52 · Score: 2
  
  Uh.. given that sort of standard, no Android application has ever been developed since the x86 PCs that are used to develop 100% of Android applications lack practically all features of the ARM SoCs that run those applications (the only exceptions being the newer Baytrail Android tablets that are also x86).
  Also: There's a space of about a million miles between "TSX ALWAYS FAILS EVERY SINGLE TIME NO EXCEPTIONS AND CAN NEVER BE USED EVAR!!" with "Oh, we found through extensive testing that under certain conditions TSX can cause issues. Don't use it for your nuclear power plant control system, but it's perfectly fine for non-critical testing. Oh, and just to be safe, we've made a microcode update to disable it."
  
  --
  AntiFA: An abbreviation for Anti First Amendment.
7. Re:Not all that surprising... by Anonymous Coward · 2014-08-12 09:25 · Score: 2, Insightful
  
  I for one , would love to know how your 'safe' language manages to avoid dead locks, priority inversion, race conditions or guarantee lock-free processes on anything more complex than a singly linked list. Please enlighten me, I'm clearly ignorant.
8. Re:Not all that surprising... by CajunArson · 2014-08-12 09:54 · Score: 5, Insightful
  
  Nobody has been robbed.
  TSX today works exactly as well as TSX worked yesterday, and considering that Haswell has been on the market for over 1 year, I assure you that anybody who has been chomping at the bit to use TSX has been using TSX.
  If the TSX erratum were trivially easy to trigger, then this article would have been posted last spring before Haswell even launched.
  Intel has done the responsible thing by acknowledging the bug (trust me son, AMD & Nvidia often don't bother with that part of the process) and giving developers the OPTION to either use TSX as-is or disable it to ensure that it cannot cause instability no matter what weird operating conditions can occur.
  Tell ya what, why don't you take all your nerd-rage over to AMD or ARM where they won't rob you of all kinds of advanced features that they just don't bother to implement at all.
  
  --
  AntiFA: An abbreviation for Anti First Amendment.
9. Re:Not all that surprising... by EvilJoker · 2014-08-12 09:58 · Score: 4, Informative
  
  I know this was a troll, but I feel compelled to reply in case someone doesn't know.
  ALL CPUs have errata. Some of it more significant than others.
  A quick Google for "AMD errata" revealed Revision Guide for AMD Family 16h Models 00h-0Fh, published June 2013, and applying to AMD's Mobile A,E, and G series, and Opteron X1100/X2100 (These are modern CPUs)
  There are 21 entries, with descriptions, system impact, and suggested workaround (if any)
  Haswell's errata has 131 entries
10. Re:Not all that surprising... by Anonymous Coward · 2014-08-12 09:59 · Score: 4, Informative
  
  See also Pentium 5 and the FDIV bug. It falls under "too bad, so sad, try your luck with the next revision".
  No. Intel offered to replace any P5 with the FDIV bug upon request. Most customers did not request a replacement, but the option was available.
  Not at first they didn't.
  My friend was doing his master on neural networks (?) at the time and some of his algorithms were giving back hinky results, especially when he compared them to some of the SPARC systems.
  He had to actually provide documentation that it effected him, and I think sign an NDA, before Intel would give him anything. He jumped through their hoops to get a replacement, and then the very next week Intel announced their carte blanche replacement program.
  It took much screaming in the industry before Intel became "generous".
11. Re:Not all that surprising... by mwvdlee · 2014-08-12 10:09 · Score: 2
  
  CPU's with TSX were first releasing in June 2013. Not really "early adopter" terrain any more.
  
  --
  Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
12. Re:Not all that surprising... by ghettoimp · 2014-08-12 12:08 · Score: 2
  
  The FDIV bug was actually relatively limited in scope. Quoting Wikipedia, "Though rarely encountered by average users (Byte magazine estimated that 1 in 9 billion floating point divides with random parameters would produce inaccurate results),[3] both the flaw and Intel's initial handling of the matter were heavily criticized. Intel ultimately recalled the defective processors."
13. Re:Not all that surprising... by Anonymous Coward · 2014-08-12 12:10 · Score: 2, Informative
  
  Huh? TSX shipped with Xeon-E3 v3 CPUs. I bought one LAST YEAR so I could play around with TSX.
  Note the RTM at the end of the flags. That signals support for the new TSX instructions. RTM means "Restricted Transactional Memory", as opposed to the other half of TSX, HLE, which is a backwards compatible change in semantics.
  $ cat /proc/cpuinfo | head -n25
  processor : 0
  vendor_id : GenuineIntel
  cpu family : 6
  model : 60
  model name : Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz
  stepping : 3
  microcode : 0x10
  cpu MHz : 800.000
  cache size : 8192 KB
  physical id : 0
  siblings : 8
  core id : 0
  cpu cores : 4
  apicid : 0
  initial apicid : 0
  fpu : yes
  fpu_exception : yes
  cpuid level : 13
  wp : yes
  flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm
  bogomips : 6585.24
  clflush size : 64
  cache_alignment : 64
  address sizes : 39 bits physical, 48 bits virtual
  power management:
14. Re:Not all that surprising... by Sun · 2014-08-12 15:56 · Score: 3, Informative
  
  I have a firend who came to me, eyes all glowing, about this new feature his shining new CPU has. I listened in and was skeptical.
  He then tried, for over a month, to get this feature to produce better results than traditional synchronization methods. This included a lot of dead ends due to simple misunderstandings (try to debug your transation by adding prints: no good - a system call is guaranteed to cancel the transaction).
  We had, for example, a lot of hard times getting proper benchmarks for the feature. Most actual use cases include a relatively low contention rate. Producing a benchmark that will have low contention on the one hand, but allow you to actually test how efficient a synchronized algorhtm is on the other is not an easy task.
  After a lot of going back and forth, as well as some nagging to people at Intel (who, suprisingly, answered him), he came across the following conclusion (shared with others):
  Many times a traditional mutex will, actually, be faster. Other times, it might be possible to gain a few extra nanoseconds using transactions, but the speed difference is, by no means, mind blowing. Either way, the amount you pay in code complexity (i.e. bugs) and reduced abstraction hardly seems worth it.
  At least as it is implemented right now (but I, personally, fail to see how this changes in the future. Then again, I have been known to miss things in the past), the speed difference isn't going to be mind blowing.
  Shachar
15. Re:Not all that surprising... by rrohbeck · 2014-08-12 20:06 · Score: 4, Informative
  
  Singular: Erratum
  Plural: Errata
  
  --
  thegodmovie.com - watch it
16. Re:Not all that surprising... by TheRaven64 · 2014-08-12 21:02 · Score: 3, Informative
  
  It depends a lot on the data structures. There were a number of papers using TSX at EuroSys this year. The main conclusion was that TSX lets you get similar performance from simple approaches as you can get already from complex approaches. For example, you can protect a long linked list in a single lock and use HLE to get a big speedup with lots of concurrent insertions and accesses, but you can achieve similar performance with a fine-grained locking scheme. There was a nice paper about Cuckoo hashing where they initially found that TSX gave them a performance win, but then were able to get a similar speedup without it.
  The big win with TSX is that it's pretty easy to reason about coarse-grained locking and much harder to reason about fine-grained locking. If you can make coarse-grained locking almost as fast as fine-grained, then that's a huge saving on testing and debugging time.
  
  --
  I am TheRaven on Soylent News
17. Re:Not all that surprising... by K.+S.+Kyosuke · 2014-08-12 23:08 · Score: 2
  
  It has always been my understanding that HTM may not necessarily increase execution performance (outright), but always offers one huge win in terms of operation composability, which is something that individual locks are never going to have. In other words, even if it doesn't make identical programs faster, it ought to make the programming process faster, which is what modern programming seems to be about. An interesting question is what percentage of performance increase can one expect from significant restructurings of complex programs. I have no answer to that, though. (But the things you were saying seem to indicate to me that you haven't explored that path, and I'm not really sure why you claim that this increases code complexity and decreases abstraction when it is really the purpose of this HW design to work in the opposite way, at least once compilers and application libraries will be able to deal with the feature. Have RDBMS people ever complained that dealing with transactions is less abstract than dealing with low-level locks manually?)
  
  --
  Ezekiel 23:20
18. Re:Not all that surprising... by AmiMoJo · 2014-08-13 00:43 · Score: 2
  
  The law in most European countries requires that defective products be replaced. If a feature was advertised but doesn't work the vendor (not the manufacturer) can either replace it with one that does work or give a refund. The refund can either but full or partial, negotiated with the buyer and depending on how useful the product is without that feature.
  If I had one of these chips I'd be looking for a full refund or replacement with a fixed version as soon as a fix was available.
  
  --
  const int one = 65536; (Silvermoon, Texture.cs)
  SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Well, we call them... by ThatsDrDangerToYou · 2014-08-12 07:22 · Score: 2, Funny

"Featurata"
1. Re:Well, we call them... by wonkey_monkey · 2014-08-12 08:17 · Score: 3, Funny
  
  It's okay, Intel are setting a new subdivision to undo these problems. And to maximise employee happiness, it's being built in the Canary Islands.
  I think I'd enjoy being a Featurata Reverter in Fuertaventura.
  
  --
  systemd is Roko's Basilisk.
Can I have a refund? by Anonymous Coward · 2014-08-12 07:23 · Score: 2, Informative

In some countries I would be entitled to get the product that was advertised or get a refund.
1. Re:Can I have a refund? by Rashdot · 2014-08-12 08:33 · Score: 2
  
  Of course. According to my Pentium you're entitled to $0.99989960954
  
  --
  This is not the sig you're looking for.
a bug != errata by Ecuador · 2014-08-12 07:25 · Score: 3, Insightful

You either say "bugs - or errata" or "a bug - or erratum", since bug is singular and errata plural. At least the error - or "erratum" (see what I did here) in this case was in TFA and not introduced in the /. summary.

--
Violence is the last refuge of the incompetent. Polar Scope Align for iOS
So how does one find out /apply "fix" with linux? by Ungrounded+Lightning · 2014-08-12 08:08 · Score: 2

It would have been nice if TFA had told us what chips were affected, or how to determine that, rather than saying "haswell" and expecting everybody reading it to do their own research.
I just spent ten minutes looking around the web, trying to determine if the processor in my laptop is one of those affected - preperatory to perhaps trying to figure out, if it is, how to apply the "disable the broken feature" fix - without installing windows - to avoid the memory corruption bogyman if somebody distributes software that uses, or abuses the feature.
No joy. The documentation seems to say that:
- Core i7 is Haswell
- TSX is NOT supported on versions up to somethng BEFORE the processor version in my laptop (i7-4700MQ)
- But the descriptions of that processor I've found so far don't say, one way or another, whether it does or doesn't have TSX. B-b
The "flags" field in /proc/cpuinfo doesn't include a "tsx". But would it?
Can anyone tell us a simple way to check?

--
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
Re:So how does one find out /apply "fix" with linu by heezer7 · 2014-08-12 08:20 · Score: 2

Check the Intel ARK page for your model number Ex: http://ark.intel.com/products/...
Re:So how does one find out /apply "fix" with linu by cheese_boy · 2014-08-12 08:34 · Score: 2

Can anyone tell us a simple way to check?
Intel has on their website info on the processors.
For example, for yours (i7-4700mq) you would look at:
http://ark.intel.com/products/75117/Intel-Core-i7-4700MQ-Processor-6M-Cache-up-to-3_40-GHz
Or you can look for all products that were "formerly haswell":
http://ark.intel.com/products/codename/42174/Haswell#@All
how to apply the "disable the broken feature" fix - without installing windows
I would do some searches for updating BIOS from linux - ex:
https://wiki.archlinux.org/index.php/Flashing_BIOS_from_Linux
Or doing a microcode update:
https://wiki.archlinux.org/index.php/Microcode
Until there is a chip for sale that really supports TSX I wouldn't expect anyone to be distributing software that uses it. So I wouldn't be too worried about it yet.
Re:So how does one find out /apply "fix" with linu by Anonymous Coward · 2014-08-12 08:35 · Score: 3, Informative

Wikipedia has very detailed information on Intel processors. This page does not list TSX for your processor and does list it for others.
Most Linux distros automatically handle Intel microcode patches (which I assume is how this errata will be handled). See Debian wiki or Arch wiki for details.
Re:So how does one find out /apply "fix" with linu by BitZtream · 2014-08-12 08:36 · Score: 2

ARK is your friend if you don't have the CPU. dmesg, kernel boot showing feature flags, or CPU-id or whatever the windows app is will all tell you what your CPU supports.
Your Linux box will probably just have an update with new microcode for the issue and you'll never need to know anything about it, or it will fiddle with the cpu flags to show it as disabled anyway.
Basically 'if you don't know, it doesn't affect you'

--
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
Re:Bought a 4770 instead of 4770K because of TSX by CajunArson · 2014-08-12 09:43 · Score: 3, Informative

You can still "play with this instruction" all you want.
What happened here is that a third party developer managed to uncover a corner case where certain interactions with TSX can lead to instability. In order to be safe, Intel acknowledged the bug (a refreshing response) and is now giving you the OPTION to disable TSX if you feel that it could impinge the stability of a production load.
So basically: Go ahead and play with TSX all you want, but be aware of the errata and that it's theoretically possible to hang your machine in some corner cases.

--
AntiFA: An abbreviation for Anti First Amendment.
Problem and possible alternatives by enriquevagu · 2014-08-12 12:28 · Score: 5, Informative

This is a real pity for the TM community. This is not the first chip with transactional memory support in hardware: The Sun Rock was announced to have hardware TM support, and the IBM Blue Gene/Q Compute chip also supports it. Unlike other proposals for unbounded transactional memory, all these systems employ Hybrid Transactional Memory (ref, ref, ref), in which restricted hardware transactions are designed to correctly coexist with unbounded software transactions, so a software transaction can be started in case a hardware transaction fails for some unavoidable issue (such as lack of cache size or associativity to hold speculative data from the transaction, not because of a conflict). Note that, in any case, very large transactions should arguably be very uncommon, since they would significantly reduce performance (similar to very large critical sections protected by locks).
The problem with the hardware implementation of transactional memory is that they are not simply a new set of instructions which are independent from the rest of the processor. HTM implies multiple aspects, including multiversioning caching for speculative data; allowing for the commit of speculative (transactional) instructions, which could be later rolled back (note that in any other speculative operation such as instructions after branch prediction, the speculation is always resolved before instruction commits because the branch commits earlier); a tight integration with the coherence protocol (see LogTM-SE for an alternative to this very last issue, but still...); a mechanism to support atomic commits in presence of coherence invalidations... From the point of view of processor verification, this is a complete nightmare because these new "extensions" basically impact the complete processor pipeline and coherence protocol, and verifying that every single instruction and data structure behaves as expected in isolation does not guarantee that they will operate correctly in presence of multiple transactions (and non-transactional conflicting code) in multiple cores. There are some formal studies such as this or this, and the IBM people discuss the verification of their Blue Gene TM system in this paper (paywalled).
As some others commented before, the nature of the "bug" has not been disclosed. However, since it seems to be easy to reproduce systematically, I would expect it to be related to incorrect speculative data handling in a single transaction (or something similar), rather than races between multiple transactions.
Regarding the alternatives, Intel cannot simply remove these instructions opcodes because previous code would fail. I assume that the patch will make all hardware transactions fail on startup, with an specific error (EAX bit 1 indicates if the transaction can succeed on a retry; setting this flag to 0 should trigger a software transaction). In such case, execution continues at the fallback routine indicated in the XBEGIN instruction, which should begin a software transaction. Effectively, this will be similar to a software TM (STM) with additional overheads (starting the hardware transaction and aborting it; detecting conflicts with nonexistent hardware transactions) that would make it slower than a pure STM implementation.
Phonology vs. morphology: compare "data" by tepples · 2014-08-13 00:52 · Score: 2

That's different. I'll explain for the benefit of ESLers reading Slashdot:
The use of "a" or "an" in modern English is always conditioned by the phonology. The rule is that "an" becomes "a" when followed by a phoneme with a sonority below "vowel". Hence "a hedgehog" in standard or "an hedgehog" (pronounced "an edge Ogg") in voiced-aitch dialects such as Cockney. I've seen only one consistent exception to this rule: "an hero" referring to one who commits suicide, which retains "an" even in voiceless-aitch dialects.
By contrast, the reanalysis of a plural first as a mass noun and eventually as a singular referring to the collection is closer to morphology. The behavior of "errata" has loosely paralleled that of "data", which has already become a mass noun taking a singular (such as "the data is..."), with "datum" having become archaic in favor of "data point" or "piece of data". The step after a mass noun is a collective, which can lead to a double plural; "erratas" refers to what would be called "collections of errata" under the older convention.