Slashdot Mirror


Intel Skylake Bug Causes PCs To Freeze During Complex Workloads (arstechnica.com)

chalsall writes: Intel has confirmed an in-the-wild bug that can freeze its Skylake processors. The company is pushing out a BIOS fix. Ars reports: "No reason has been given as to why the bug occurs, but it's confirmed to affect both Linux and Windows-based systems. Prime95, which has historically been used to benchmark and stress-test computers, uses Fast Fourier Transforms to multiply extremely large numbers. A particular exponent size, 14,942,209, has been found to cause the system crashes. While the bug was discovered using Prime95, it could affect other industries that rely on complex computational workloads, such as scientific and financial institutions. GIMPS noted that its Prime95 software "works perfectly normal" on all other Intel processors of past generations."

122 comments

  1. Deja Voo of the Pentium 5 FDIV bug by xmas2003 · · Score: 4, Insightful

    Old-timers will remember the Pentium 5 FDIV bug where the chip sometimes yielded incorrect results for complex mathematical calculations.

    --
    Hulk SMASH Celiac Disease
    1. Re:Deja Voo of the Pentium 5 FDIV bug by Junta · · Score: 5, Informative

      Well 'Deja Vu' and you can leave '5' off.

      For an analogous screw up, you only need to look at Haswell/Broadwell and TSX feature, which they retroactively disabled due to defect.

      The FDIV was noteworthy because the state of things were such that they didn't have much recourse other than replacing the processors. We haven't seen a defect such that processors had to be physically recalled at such scale since, though there have been a number of similarly disastrous issues, if not for the fact they could push a microcode change to disable something or workaround...

      --
      XML is like violence. If it doesn't solve the problem, use more.
    2. Re:Deja Voo of the Pentium 5 FDIV bug by Anonymous Coward · · Score: 0

      The F00F bug would be closer. FDIV bug didn't directly hang the processor.

    3. Re:Deja Voo of the Pentium 5 FDIV bug by ColdWetDog · · Score: 2, Interesting

      Nah, we blame this one on the NSA, to wit:

      It only happens when running complex calculations like Mersenne primes. Who runs such calculations? It isn't the good citizens looking at their Facebook whatever it is that they look at. It's people doing crypto, ie, Terrorists.

      So how do we stop Terrorists? Don't let them do complex crypto calculations.

      QED.

      --
      Faster! Faster! Faster would be better!
    4. Re:Deja Voo of the Pentium 5 FDIV bug by 110010001000 · · Score: 4, Informative

      All processors have bugs. Some are fixed and some are not. You can obtain errata sheets from the manufacturers. At least this one is easily fixable.

    5. Re: Deja Voo of the Pentium 5 FDIV bug by WarJolt · · Score: 1

      It's not a bug. It's a "specification update". Get it right. Clearly you were using the wrong specification.

    6. Re:Deja Voo of the Pentium 5 FDIV bug by Junta · · Score: 1

      Probably the TSX problems are closer in some respect, that the fix comes in microcode. With F00F, the OSes had to workaround the issue one way or another.

      --
      XML is like violence. If it doesn't solve the problem, use more.
    7. Re:Deja Voo of the Pentium 5 FDIV bug by serviscope_minor · · Score: 3, Funny

      Old-timers will remember the Pentium 5 FDIV bug

      5? That was the 80 4.999999583694 86 processor was it not?

      --
      SJW n. One who posts facts.
    8. Re:Deja Voo of the Pentium 5 FDIV bug by ickleberry · · Score: 1

      The world needs more F00F bugs, for the name alone

    9. Re:Deja Voo of the Pentium 5 FDIV bug by Anonymous Coward · · Score: 0

      And...it's probably a good thing...because a hung processor that doesn't continue to spit out bad calculations is by far preferable to one that just continues on it's own merry way as if nothing is going wrong.

    10. Re:Deja Voo of the Pentium 5 FDIV bug by PRMan · · Score: 2

      Actually, it was allegedly found by overclockers leaving their systems running Prime95 for extended periods (who else runs Prime95)?

      --
      Peter predicted that you would "deliberately forget" creation 2000 years ago...
    11. Re:Deja Voo of the Pentium 5 FDIV bug by BarbaraHudson · · Score: 2

      The real problem with the FDIV bug is in how Intel handled it - they refused to replace an admittedly defective part unless you could show that you specifically were affected. Betting for a repeat here.

      --
      "Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
    12. Re:Deja Voo of the Pentium 5 FDIV bug by whit3 · · Score: 2

      The real problem with the FDIV bug is in how Intel handled it - they refused to replace an admittedly defective part unless ...

      Well, that was the first response. Eventually, though, they bit the bullet

      "Monday, December 19 [1994] we changed out policy completely. We decided to replace anybody's part who wanted it replaced... replacing people's chips by the hundreds of thousands... We created a service network to handle the physical replacement for people who didn't ant to do it themselves."

      -- from Only the Paranoid Survive , Andrew S. Grove, 1996

      It was estimated this cost Intel $475 million.

    13. Re: Deja Voo of the Pentium 5 FDIV bug by Anonymous Coward · · Score: 0

      That's my general feeling over this, what if a buffer is filling up sequentially and then causes the panic/halt.

      It seems it wasn't detected under normal use, this could be evidence of a glitch in a backdoor impl..

    14. Re:Deja Voo of the Pentium 5 FDIV bug by Zero__Kelvin · · Score: 1

      Well that depends on the situation and application code really. If the software in question is an RTOS in a life critical application, and the accounting error doesn't affect the life sustaining part of the program, then it might very well be deadly for it to halt and might actually allow a person to continue living if the code is allowed to continue after deriving the erroneous result.

      --
      Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
    15. Re:Deja Voo of the Pentium 5 FDIV bug by Anonymous Coward · · Score: 0

      >(who else runs Prime95)?

      Terrorists.

    16. Re:Deja Voo of the Pentium 5 FDIV bug by BarbaraHudson · · Score: 2

      Sure, they eventually caved - but this was only after chip yields rose. The prospect of a class action forcing them to pay out the full price of every chip sold during the low-yield period at the cost originally paid would have been a LOT more (it would have also been based on chips sold, not just people who actually filed a claim).

      --
      "Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
    17. Re:Deja Voo of the Pentium 5 FDIV bug by Anonymous Coward · · Score: 1

      They were only looking for weapons of math destruction.

    18. Re:Deja Voo of the Pentium 5 FDIV bug by Anonymous Coward · · Score: 1

      Well 'Deja Vu' and you can leave '5' off.

      For an analogous screw up, you only need to look at Haswell/Broadwell and TSX feature, which they retroactively disabled due to defect.

      The FDIV was noteworthy because the state of things were such that they didn't have much recourse other than replacing the processors. We haven't seen a defect such that processors had to be physically recalled at such scale since, though there have been a number of similarly disastrous issues, if not for the fact they could push a microcode change to disable something or workaround...

      That's because after FDIV, they put in a shit ton of work developing survivability features so that problems could be worked around. This is a good thing.

    19. Re:Deja Voo of the Pentium 5 FDIV bug by ATMAvatar · · Score: 1

      Don't divide: Intel Inside.

      --
      "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety."
    20. Re:Deja Voo of the Pentium 5 FDIV bug by Anonymous Coward · · Score: 0

      Actually, that was pretty great, due to the policy there was a glut of cheap P90s on the market that were just fine for gaming, yet businesses did not want...

    21. Re:Deja Voo of the Pentium 5 FDIV bug by lsatenstein · · Score: 1

      Old-timers will remember the Pentium 5 FDIV bug where the chip sometimes yielded incorrect results for complex mathematical calculations.

      Does the following make sense?
      The engineers brought back the above code, because the people who knew about it and why it should not be used, had retired. This retirement situation allowed for it's re-introduction. No, Intel will not be accepting returns for Skylake. It will be a microcode patch. The microcode patch is a backdoor input to the cpu to allow fixing instructions and breaking security.

      --
      Leslie Satenstein Montreal Quebec Canada
    22. Re:Deja Voo of the Pentium 5 FDIV bug by Anonymous Coward · · Score: 0

      That's because after FDIV, they put in a shit ton of work developing survivability features so that problems could be worked around. This is a good thing.

      What slow-down and additional power-consumption do these "features" cause (compared to pure chips build right in the fist place)?

    23. Re:Deja Voo of the Pentium 5 FDIV bug by 0xdeaddead · · Score: 1

      did you see the price tag of those HP or Compaq desktops? (Big) Businesses (of the 90s) don't do clones.

      And even if it meant we had some PhD billing us at $500 an hour, you can bet since he was a consultant he had the 386sx16, and maybe after a few months of begging we got him a 387sx. Pentium??? LOL

  2. Lack of competition by sinij · · Score: 0, Troll

    Too bad AMD is out of PC CPU race and Intel will got unpunished for such major flaw.

    1. Re:Lack of competition by Moof123 · · Score: 5, Insightful

      If you saw the actual errata list for processors on launch day, regardless of manufacturer, your jaw would drop. A lot of nasties get cleaned up on subsequent revisions (mask changes), but in the meantime patches show up for the BIOS, libraries, and compilers so that the user never sees the warts. With Billions of transistors there will be design errors that even intel will not catch during verification or characterization. The fact that a BIOS fix will take care of it is a sign that it is not that egregious.

      If you want to avoid this kind of stuff you should wait a few months after any major shakeup to buy.

    2. Re:Lack of competition by Moof123 · · Score: 5, Informative

      Go see page 21 for example:
      http://www.intel.com/content/d...

    3. Re:Lack of competition by Billly+Gates · · Score: 1

      Like software one should wait until after the product has had a revision 1st.

      Oddly we think of intel cpus and chipsets as rock solid and operating systems as garbage based on Vista, ME, and 8.1. Perhaps doing the same and buying older hardware would be wise too.

      My gigabyte board for example I am disappointed in and same with Asus when z97 haswell. Was new. Both are top brands but were extremely unstable and buggy. Asus Sabertooth is unusable and Gigabyte got stable after 4 updates somewhat.

    4. Re:Lack of competition by Anonymous Coward · · Score: 0

      unpunished

      Do you have any idea what the validation process at any company that size is like? They may as well be playing Yakity Sax in the background the entire time, and meanwhile Marketing is breathing down their necks because they want to start shipping to OEMs as soon as possible. Furthermore modern CPUs/SoCs are so complex that it's virtually impossible to test every single possible operating condition, not without delaying official launch until it's virtually outdated. Very often OEMs are the ones reporting previously unheard-of bugs.

    5. Re:Lack of competition by sinij · · Score: 2

      Surprising, I expected in-silicone code to be more robustly tested prior to getting released. Turns out, code is code.

    6. Re:Lack of competition by Anonymous Coward · · Score: 2, Interesting

      Everything is getting faster. Development cycles are getting shorter, schedules are getting tighter, margins are being trimmed down and testing is taking some of that hit. Software is already brutally paced to the point that customers are now performing QA. We're having to train our customers how to use Bugzilla and we somehow accept this as "Ok". Eventually the pacing will become so brutal that version 2 won't even use the same codebase as version 1. Posting bugs will become useless. Software development velocity is such that no-one wants to write long-lived code anymore.

      Once hardware reaches this breakneck prototyping velocity it's going to be the same thing. Defects will become more common. Revisions will become more common. Just hope they don't tell us to change out the mobo each time or we'll never get anything working. Even if the time between revisions stays the same the complexity is going up and I'd expect they're pulling all-nighters just to keep pace. Risk goes up accordingly.

    7. Re:Lack of competition by Anonymous Coward · · Score: 0

      You think testing is a magic wand that eliminates all bugs?

    8. Re:Lack of competition by epine · · Score: 1

      The fact that a BIOS fix will take care of it is a sign that it is not that egregious.

      For a given value of performance expectation, as purchased.

      One might be a little bit cheesed to discover that the entire hardware floating point subsystem has been replaced with on chip emulator, which additionally wires down half of your L2 cache to host the microcode execution vectors and/or byte codes.

      In the spirit of good will and transparency, I hope to see Intel recirculate the original sample chips to all the hardware review websites (whose benchmarks are still found all over the internet) so that these websites can all update their benchmarks (and conclusions, if necessary) to the new Skylake post-BIOS performance reality.

      Admittedly, it's not a large hope.

    9. Re:Lack of competition by grimmjeeper · · Score: 1

      You think the processor companies have the time or budget to do exhaustive testing?

    10. Re:Lack of competition by Anonymous Coward · · Score: 0

      You're right, they don't considering exhaustive testing takes billions of years.

    11. Re:Lack of competition by Anonymous Coward · · Score: 0

      budget, sure, especially considering they are selling cores for over $300 a pop, and the latest-greatest at over $1500 a pop. (not to mention that intel has a habit of selling defective high-end chips as mid-range chips if they still meet the lower tier's specs, regardless of whether that is cost-effective or not)

      time is relative with the right equipment, and a cpu manufacturer should have plenty of computing ability to get any task accomplished.

    12. Re: Lack of competition by guruevi · · Score: 1

      I find neither Gigabyte nor Asus to be "top" motherboard manufacturers. At best they are premium value boards (cheap boards with some premium features enabled). I have found them consistently to be buggy and sometimes even outright useless. The last time I bought them, I actually returned an Asus board because it 'supported' ECC RAM but didn't actually implement it (simply disabled it).

      I buy SuperMicro boards, not always on the edge but consistently configurable and very good support if any bugs do arise. I've had decent luck with Via boards way back in the day and MSI/Tyan as well.

      --
      Custom electronics and digital signage for your business: www.evcircuits.com
    13. Re:Lack of competition by grimmjeeper · · Score: 2

      The only difference between the low end and high end chips is the number of flaws in the core coming off the die. It's impossible to get a consistent yield on a wafer. Minor electrical variances, impurities in the materials, flaws in the machines that do the manufacturing, etc. The chip maker has to test each and every chip that is produced to sort them into a wide variety of performance bins. The ones that have the fewest flaws and can run the fastest get put in the most expensive bins. The ones with flaws in the cache and inoperative cores get dumped in the cheap bin. And everything in between.

      So really, they only have to test one design to root the bugs out. And the test applies to all of the grades of chips coming off the line.

      Even so, it's impossible to fully test the chip before it goes to market. So they have to decide to test it to a "good enough that we can patch it in BIOS or software patches" level.

    14. Re:Lack of competition by antdude · · Score: 2

      That is why I never buy the (new/lat)est stuff. I'll get the old and more stable stuff.

      --
      Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).
    15. Re:Lack of competition by Anonymous Coward · · Score: 0

      What the heck is "in-silicone" code? Code inside of breast implants?

      Or perhaps you meant silicon, which is a radically-different substance altogether and commonly used for integrated circuit production.

    16. Re:Lack of competition by Anonymous Coward · · Score: 0

      a couple of those really stick out like sore thumbs...

      Warm Reset May Fail or Lead to Incorrect Power Regulation -- does that mean we can't reboot?

      Accessing Physical Memory Space 0-640K through the Graphics Aperture May Cause Unpredictable System Behavior -- really intel? after all these years, we can get systems with seemingly endless amounts of memory, but your graphics card can't touch the "base 640k"??? give me a break.

      Unpredictable Operation at Turbo Frequencies Above 4.0 GHz -- the 4790k's stock clock _is_ 4ghz. does that mean we're not allowed to turbo?

    17. Re:Lack of competition by Anonymous Coward · · Score: 0

      Just hope they don't tell us to change out the mobo each time

      I thought they did. Wasn't there a bug in the chipset that causes errors in SATA 2 ports? The only fix was to exchange a motherboard with the B3 revision of the chipset.

    18. Re:Lack of competition by ChrisMaple · · Score: 1

      Simulation testing is very difficult. It is many orders of magnitude slower than the actual device. At some point, you have to ask "should we do 2 more months of simulation on this or just spend a million $ or so to fabricate some samples with the newest tiny geometry?" So you fabricate, find 50 errors missed in simulation, fix those and start simulations again. Fabricate again (whoops! there goes another million) and find that there are flaws caused by the fixes, flaws hidden by the previous flaws, newly discovered flaws, and yet there will still be flaws that won't be easily found, or found soon.

      With each new fabrication the pressure builds to market a chip that at least have known bugs that can be worked around. Customers want faster chips and more features, and they're somewhat willing to work around bugs to have what they want. Rather than wait for another multi-month cycle that angers fans and customers and costs money, the manufacturer ships and crosses its fingers. We're human beings, and we're doing the best we can, which is very good.

      --
      Contribute to civilization: ari.aynrand.org/donate
    19. Re: Lack of competition by toddestan · · Score: 1

      I've found Gigabyte to be okay, but I've never understood why people like Asus so much. Their stuff is way too flaky and unreliable to command the premium prices you'll pay for it. It's too bad that Intel stopped making motherboards (at least ones in standard form factors). They generally weren't terribly friendly to overclockers and could be a bit conservative on what settings they exposed but they tended to be pretty stable and well supported.

    20. Re:Lack of competition by Anonymous Coward · · Score: 0

      Customers want faster chips and more features, and they're somewhat willing to work around bugs to have what they want.

      Skylake isn't much faster. Practically no reason to upgrade. We'll see some whines about ever decreasing sales. But the customer isn't at fault. No faster, new bugs to be ironed out. With more time/money dumped into development something may have been achieved worth to upgrade to, with not too many new issues.

  3. F00Fies strike again? by Anonymous Coward · · Score: 0

    It doesn't matter how complex the task needs to be to trigger the bug. Once it's there, it's there.

  4. Nothing to worry about by RDW · · Score: 1, Offtopic

    It probably just means the NSA is already using your processor's compute capacity as part of their vast decryption botnet. The fix should improve resource management so you won't notice it in future.

  5. Things too complex by Anonymous Coward · · Score: 0

    Ok, they missed some test vector. Quite bad, but at this complexity level not so surprising.
    Will they resolve this by microcode update at the cost of some instruction(s) being slower? Let's bet.

  6. Not a showstopper by Anonymous Coward · · Score: 0

    A minor bug causes processor to not-process.

    Intel is suggesting a tentative fix as follows:
    - Locate the power button.
    - Switch the computer off.
    - Go outside and play badminton.

  7. Secret Decoder Ring by Anonymous Coward · · Score: 0

    I no longer have access to an Intel code name Secret decoder ring. Can someone me what the public marketing name is for Skylake?

    1. Re:Secret Decoder Ring by Anonymous Coward · · Score: 0

      tell me, that is....

    2. Re:Secret Decoder Ring by Anonymous Coward · · Score: 0

      6th-generation Core

    3. Re:Secret Decoder Ring by Anonymous Coward · · Score: 1

      Thanks, that got me one step closer, and then I found these on the Intel website:

      http://www.intel.com/content/www/us/en/processors/core/core-i3-processor.html?wapkw=skylake

      http://www.intel.com/content/www/us/en/test/manju-test/core-i5-processor.html?wapkw=skylake

      So, it look like the processor names (what I can find in the system specs) are i3-6x00 and i5-6x00, etc.

      I don't have anything that new, so I am OK.

  8. When hardware must just work by Puff_Of_Hot_Air · · Score: 5, Informative

    This is a really interesting talk from 32c3 detailing the challenges involved in designing and verifying something as complex as a CPU where it can only be simulated at 1 Hz and costs 5 million to produce silicon for testing. https://www.youtube.com/watch?v=eDmv0sDB1Ak. The level of difficulty on getting this right just blows my mind. If it weren't for economies of scale CPU's would be completely out of reach. Also interesting in the talk is the vast number of CPU defects that are found and cataloged that most people appear to be unaware of. Most are of little importance (and hence don't get fixed), but some are fixed via code (as in this case), but there is no guarantee that these are being patched by OEM's.

    1. Re:When hardware must just work by mikael · · Score: 1

      I know the 720p version of this movie would send one Intel multi-core CPU into shutdown. That was with 3D TV and an NVidia 3D Vision setup. The same graphics boards and display had no problem with another motherboard/CPU combination. Still wondering whether it was the CPU or the cooling. No problems with anything GPU related.

      http://www.3dtv.at/Movies/Skyd...

      --
      Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
    2. Re:When hardware must just work by Moof123 · · Score: 5, Interesting

      I work on ASIC design, though I am on the Analog side of things. There are more people doing verification than design by roughly 2:1. I am told that in the smaller nodes and more complex designs that the ratio is even higher. Basically you can slap down some RTL code (verilog or VHDL) quickly, but torturing it through all exceptions is very hard. Then you have to synthesize and build it, which can introduce all sorts of timing and parastic kinds of problems that have to be double checked. Finally test vectors have to be created to double check the functionality of every transistor in the design to assure that what was built matches the masks.

      It is truly phenominal that anything with Billions of gates ever works at all, let alone with the high yield and relatively low error count we have come to expect.

    3. Re:When hardware must just work by tlhIngan · · Score: 5, Interesting

      I've done this.

      First, billions of transistors is actually easy - most of the transistors in a modern CPU is actually spent on caches and other memory. Logic itself doesn't have as high a transistor density as you might think. In fact, in practically all ASIC designs, there's so much extra silicon space that they put extra gates there that do nothing but are tied to a logic value. These spare transistors serve to provide "rework" room for the design. If you look at most steppings, you start with A0, then you have A1, A2, ... B0, B1, ... etc. Well, going from A0 to A1 is basically just a metal mask change - they don't change the transistor masks (each mask costs around $100K each, and 10 layer metal designs have often 30+ masks, so a $3M cost before the first silicon is patterned). instead, they rewire the transistors using this spare sea of transistors to fix the issues - hopefully only needing to change 5, maybe 10 masks tops ($1M). When you go from Ax to B0, that implies a complete new mask set - either there are too many fixes, or the design is being revised.

      As for simulation, it's multi-stage. First each block is individually tested, and simulated, then it's all brought together and software simulated to check for easy to spot faults and have full inner visibility to see why things are the way they are. The complexity of modern CPUs and SoCs means this is only around 1Hz, usually less, so it's reserved for initial testing and sanity checking test vectors.

      The next step is to put in on an accelerator - systems like Cadence's Palladium which can get your clock speeds up to the hundreds of Hz range. The simulation isn't as visible and the timings can be off, but you can functionally check most of the blocks and with careful probes design, bring error cases back to the software model to understand what's going on.

      The next stage is FPGA simulation - you're testing the logic itself and FPGAs (we're talking about the ones that cost easily $30K each, and no, you need at least 4 or 8 of them or more - that's a quarter million dollars in FPGAs!). But the system moves to the kHz range to even 1MHz. Which despite its slowness, is actually fast enough to boot an OS like Windows or Linux or run test software so software development for drivers and such can begin. Visibility is limited to whatever probes you could install and whatever debugging tools your FPGA toolset has.

      Then it's all laid out and routed and all that, and software simulations are run to verify timings - ensuring there are no setup and hold violations in the final floorplan.

      And it's not as bad as you think - each block is quite independent and as long as the interface contract is held (setup and hold, timings and other things for the block), the tools will tell you how close you are to violating the specs for each block. So you can test each block in isolation and as long as the interface contract is held, be assured it will work.

      Of course, it won't catch integration errors like ground bounce or other such things that. It's akin more to building a space shuttle or airplane - with the right design, you can get something that works.

    4. Re:When hardware must just work by Anonymous Coward · · Score: 0

      This is a really interesting talk from 32c3 detailing the challenges involved in designing and verifying something as complex as a CPU where it can only be simulated at 1 Hz and costs 5 million to produce silicon for testing. https://www.youtube.com/watch?v=eDmv0sDB1Ak. The level of difficulty on getting this right just blows my mind. If it weren't for economies of scale CPU's would be completely out of reach. Also interesting in the talk is the vast number of CPU defects that are found and cataloged that most people appear to be unaware of. Most are of little importance (and hence don't get fixed), but some are fixed via code (as in this case), but there is no guarantee that these are being patched by OEM's.

      His complaints about the slow speed of simulation are actually an argument for putting more large circuits on test chips. A test chip might take 6 months to get back from the fab, but you can test a lot of logic very quickly on a test chip. Plan to do it early an often and you can dramatically speed the validation time. This argument is very current in big silicon design teams right now. Simulation of large systems has got slower than building test chips.

    5. Re:When hardware must just work by Anonymous Coward · · Score: 1

      learn to write mom(s) in the possessive form, douchebag's.
       

      I do not have a dog in this fight, but I would just like to point out that "mom basement" is as valid a term as "man cave".

    6. Re:When hardware must just work by Anonymous Coward · · Score: 1

      As a CPU designer that was my thoughts as well. tlhIngan is/was probably involved in ASIC design but doesn't understand the significance of verifying architectural correctness and the sophistication of DFT practices.

    7. Re:When hardware must just work by Anonymous Coward · · Score: 0

      I do not have a dog in this fight, but I would just like to point out that "mom basement" is as valid a term as "man cave".

      Sure, but there aren't the term used here wasn't to a basement for moms. But I'm very curious about such a place. At my age, I pretty much only date moms.

    8. Re:When hardware must just work by Anonymous Coward · · Score: 0

      I think his argument was essentially that since they're memory you can loop through the memory performing operations to verify it, just like memtest. It misses the fact that if the wiring is wrong the memory could be fine but have sideaffects on other parts of the chip, for example if writing to a specific byte range might corrupt the registers.

    9. Re:When hardware must just work by Anonymous Coward · · Score: 0

      Do you really mean billions upon billions? respectfully Carl Sagan.....

    10. Re:When hardware must just work by Anonymous Coward · · Score: 0

      Billions upon billions ! signed Carl Sagan.

      Now to post this a few billion more times on Slashdot.
      First I'll have to buy one of those Skylake processors to automate the posts and hope it doesn't freeze at the first post under the high workload
      or Slashdot crashes taking my comments due to the workload :)

    11. Re:When hardware must just work by tibit · · Score: 1

      Do you have a Czar of Bandgaps, and do you dread temperature-dependent startup problems yet? :)

      --
      A successful API design takes a mixture of software design and pedagogy.
    12. Re:When hardware must just work by tibit · · Score: 1

      Those would be rather gross synthesis or layout errors. These kinds of "miswirings" are almost impossible with properly modularized HDL. If you get the "memory affects registers" kind of a bug, there's something very wrong somewhere in the tools, but it's super unlikely that it'd be a design error.

      --
      A successful API design takes a mixture of software design and pedagogy.
  9. Does the same thing with Autocad by Anonymous Coward · · Score: 1

    Just got a MSI with 32GB of RAM and the skylake processor because I need to manipulate large Autocad files. For no reason my laptop would lock up and nothing would be in the dump logs. I could not figure it out...until now.

    1. Re:Does the same thing with Autocad by Anonymous Coward · · Score: 1

      You were running windows?

    2. Re:Does the same thing with Autocad by RogueyWon · · Score: 1

      I think you might want to look elsewhere for your problems. I've got an MSI z170a motherboard, an i7 6700K and 32GB RAM, which I use to manipulate large Autocad files... and I have had absolutely no issues at all.

    3. Re:Does the same thing with Autocad by Zero__Kelvin · · Score: 1

      You still haven't figured it out. You are assuming this is your problem. Unless you are an AutoCAD developer and have built the source with debugging enabled and then actually used it to single step to the offending instruction and watched the problem occur, you are still operating a (not completely unreasonable) assumption.

      --
      Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
  10. Why TFA states BIOS? by short · · Score: 1

    Isn't it easier to distibute new firmware with microcode_ctl/intel-microcode packages? MS-Windows also seems to have some such package updates.

    1. Re:Why TFA states BIOS? by 110010001000 · · Score: 1

      Intel needs to push the microcode update through the BIOS. You can't do it via a OS update. So hopefully your motherboard manufacturer picks up on this.

    2. Re:Why TFA states BIOS? by barbariccow · · Score: 2

      Linux applies microcode updates at runtime...

    3. Re:Why TFA states BIOS? by short · · Score: 1

      Motherboard manufacturer can do whatever they want but unless I reflash my BIOS it has no effect. And I do not regularly reflash my BIOS, do you? Besides that I still find the automatic nightly package update easier.

    4. Re:Why TFA states BIOS? by Anonymous Coward · · Score: 0

      Yes you can.

    5. Re:Why TFA states BIOS? by Zero__Kelvin · · Score: 1

      Only if the kernel was built with CONFIG_MICROCODE_INTEL=y set in the .config file.

      --
      Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
    6. Re:Why TFA states BIOS? by Billly+Gates · · Score: 1

      Intel needs to push the microcode update through the BIOS. You can't do it via a OS update. So hopefully your motherboard manufacturer picks up on this.

      How often does Joe Six pack update his bios? I mean really? It makes sense to patch the cpu at startup as most of these users have updates enabled by default because their computer came that way when they turned it on

    7. Re:Why TFA states BIOS? by 110010001000 · · Score: 1

      Not in this case you can't. For this particular update it needs to go through the BIOS due to the components involved.

    8. Re:Why TFA states BIOS? by Anonymous Coward · · Score: 0

      Well, skylake is a mess. If you have a new enough UEFI (the "BIOS" firmware by its proper name, BIOS was retired by the time 2nd-gen Core was out) and at least microcode 0x40 or thereabouts, chances are the bootloader and O.S. will get to the microcode update step safely enough for it to work.

      This does require that even the O.S. installer has a microcode update block, though, because at least Fedora, Ubuntu and Debian's installers will outright *crash* if the Skylake microcode is too old. Windows' installer would crash too, but you likely got that pre-imaged from the system vendor, so you wouldn't know it :-p

      However, anything other than a UEFI primary microcode update could arrive too late: it is entirely possible for errata to create unstable chip state/memory corruption that a microcode update can't fix post-facto, so you need to update the microcode well before memory init, during the very initial coldboot stage before platform init(!): at that time, the chip is running in cache-as-memory mode and there is almost no state that could get corrupted (as most of it will get overwritten during initial microcode update and the platform init soon after that).

      The nasty thing is that there are no skylake microcode updates available for O.S. install at this time: you have to get it from a UEFI update because Intel is not distributing it in any other ways. I have no idea why, though.

      As for this being the first nasty Skylake errata, this is utter crap. It has been crashing left and right since launch date, and getting fixed by new firmware updates. The microcode is already at revision 0x69/0x6A, and we've seen (through UEFI updates from several manufacturers) at least *20* public releases before that, I think microcode was at revision 0x36 or a bit less than that at launch.

    9. Re:Why TFA states BIOS? by Anonymous Coward · · Score: 0

      firmware updates are _not_ optional for Skylake systems, at least for the time being :-( Maybe in a year or so, all the nasty crap will be fixed and you will be able to get away with keeping the firmware as-shipped.

    10. Re:Why TFA states BIOS? by dfsmith · · Score: 1

      Yup. I got a Skylake 100 Series board in August. Only in November did the BIOS make the system stable at last. That was about the 6th version I'd flashed. Ugh. I'd had many filesystem errors/RAID1 mis-matches due to lock-ups!

    11. Re:Why TFA states BIOS? by tibit · · Score: 1

      No, they don't. Yes you can. Maybe - meh.

      --
      A successful API design takes a mixture of software design and pedagogy.
  11. That's one hell of a heat sink by lucm · · Score: 1

    The CPU makes the PC freeze? If they could just crank this bug down a bit it could revolution the server cooling industry.

    --
    lucm, indeed.
  12. At 3ghz 1 in a billion is 3 times a second by Anonymous Coward · · Score: 5, Interesting

    Just saw this video

    https://www.youtube.com/watch?v=eDmv0sDB1Ak

    Gives some insight in to the insanely complex nature of processor design and how absurdly reliable they need to be. Modern computers pretty much expect the CPU to be flawless and that's a daunting task considering their complexity and the staggering amount of computations they perform even in ordinary day-to-day use.

    An error that occurs one in a billion operations will happen 3 times a second at 3ghz.

    So yeah. Some bugs are gonna happen. Thankfully most can be fixed with microcode updates.

    1. Re:At 3ghz 1 in a billion is 3 times a second by ArylAkamov · · Score: 1

      Now I feel bad for OCing to 5 Ghz

  13. My biggest "contribution to the good of humanity" by Anonymous Coward · · Score: 0

    Back in the days of the FDiv bug, Intel did not do a good job of disclosing bugs in their CPU's. The pressure kept mounting on them to both fix the bug (and to provide a mechanism for system and software vendors to know they are there in the first place), but they stubbonly refused to do the right thing. So, a columnist for PCWeek (I think it was Dave Berlind) wrote a front page article about the issue. I told him I had canceled the PC orders I placed and would not buy more of them until the situation was resolved. A short while later, Intel changed their tune and also started being more open with the bugs in their processors (PS: we didn't mention that my canceled PC order was for 3 PC's!!! - not exactly dishonest as it was true and was probably representative of most tech people back then, but kind of funny.)

  14. Kimchi, Satan-san? re: FOOF! by Anonymous Coward · · Score: 1

    indeed. Streng was bold.

    http://blogs.sciencemag.org/pipeline/archives/2010/02/23/things_i_wont_work_with_dioxygen_difluoride

    see also

    https://what-if.xkcd.com/40/

  15. And history repeats itself... by Anonymous Coward · · Score: 0

    "I am Pentium of Intel, SIN will be approximated."
    "I am Skylake of Int..."
    "B.. B... B.. Bulldozer!"

  16. Few, OS X is safe! by Imazalil · · Score: 0

    Well, count my lucky stars that OS X isn't affected! Mac master race wins again! I'm guessing there's no Prime95 mac users, so therefore I must be safe, right? right?

    On a slightly more serious note, how does one bios-update the CPU on a Mac? Does Apple roll it into their updates? Just curious.

    1. Re:Few, OS X is safe! by larkost · · Score: 1

      Apple calls these sorts of things "firmware updates" (yes that is a generic name). Things like this are included, as are things like updates for ethernet chipsets, firewire routers (there are 3 on the MacPro), and even rarely firmware for the GPU. Additionally there are sometimes "SMC" updates for the part of the computer that manages power and sleep behavior.

    2. Re:Few, OS X is safe! by Anonymous Coward · · Score: 0

      Yup, macs were never meant to be used for complex computational workloads.

    3. Re:Few, OS X is safe! by Anonymous Coward · · Score: 0

      Few people can spell "phew" correctly.

    4. Re:Few, OS X is safe! by Imazalil · · Score: 1

      Guilty as charged, but I'm going to go with "Everyone, I found the OS X manifestation of this bug!!"

  17. Well duh! by Anonymous Coward · · Score: 0

    It's sentient.
    It knows not to fsck with E40001

  18. Sounds just like Apple's HFS+ Journalling bug by Anonymous Coward · · Score: 0

    Though of course, being Apple, they won't admit they have one.

    You can have concurrent workloads writing to a HFS+ Journal volume all day long and have no problems - even though it sucks about 20% kernel time. As soon as the CPU gets starved for cycles by user processes (think something like transcoding) a race condition rears its ugly head, OSX panics, and suddenly users see the infamous "Disk not ejected properly" error message. OSX then dutifully remounts the volume and the journal rollback erases all uncommitted file and folder changes. Thanks, Apple!

  19. now all my Intel 585.879436603 jokes go faster! by swschrad · · Score: 1

    and run simultaneously on 7.9335 threads, too!

    --
    if this is supposed to be a new economy, how come they still want my old fashioned money?
  20. Re:My biggest "contribution to the good of humanit by Zero__Kelvin · · Score: 1

    This is awesome!. I so rarely get a chance to use the phrase correlation != causation on Slashdot! (Also, I have some awesome swamp^H^H^H^H^Hland for sale, cheap!)

    --
    Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
  21. More correct than you realize by skinlayers · · Score: 1

    Correct, but for the wrong reason:
    There are currently no Apple products that utilize a Skylake CPU.

  22. Correction by skinlayers · · Score: 1

    I was incorrect. The 2015 iMac has a Skylake CPU.

  23. Prime95 is now an industry? by Anonymous Coward · · Score: 0

    How exactly does one use "Fast Fourier Transforms to multiply extremely large numbers" and when exactly did Prime95 become an industry?

    1. Re:Prime95 is now an industry? by slew · · Score: 2

      How exactly does one use "Fast Fourier Transforms to multiply extremely large numbers" and when exactly did Prime95 become an industry?

      The most common way to multiply numbers larger than the register size of the machine (e.g., 4000 bit numbers) is to express it like most people multiply numbers more than 1 digit relative to some base R.

      (c0 + c1*R+ c2*R^2 + c3*R^3 + ...) * (d0 + d1*R+ d2*R^2 + d3*R^3 + ...) = (p0 + p1*R+ p2*R^2 + p3*R^3 + ...)

      Where R is 10 for humans, for a computer, R is some power of 2 (because computers like that).

      A basic observation of the math is that product of digits computed this way is very similar to a linear convolution of those digits (coefficients in this representation) and you can speed up large convolutions using an FFT. If you pick R small enough, you can do the multiplication and all the partial products together without any rounding problems using the SSE/AVX SIMD floating point math on your x86-64 computer.**

      Prime95 is freeware app that is used by GIMPS that uses this FFT technique to multiple large numbers together very quickly and is a big stress on the CPU because the code is highly optimized.

      Nobody claimed Prime95 is an "industry", but other industries that rely on skylake processors to do complex operations might be affected by the same bug Prime95 has triggered.

      **Interestingly, the straight forward integer multiplication is slower than floating point for a certain precisions in nearly all x86-64 implementations because of a premium on SSE/AVX speed, intel has invested more on 32-bit FP math (24-bit mantissa multiplier for FP), vs 32-bit int math (32-bit x 32-bit -> 64bit int multipliers are much bigger)

    2. Re:Prime95 is now an industry? by tlhIngan · · Score: 1, Informative

      That's the technical explanation, but the mathematical one is actually fairly simple - you convert the multiplication to an addition. There are several ways to do this - logarithms are one common way (A*B = inverse log(log(A) + log(B)) ), but so is convolution, or realizing that addition and multiplication in say, the time domain becomes multiplication and addition in the frequency domain, respectively.

      So if you have two numbers, you do the FFT of them to convert the domains, then you add them up, and then do the inverse FFT. The FFT is not the only way - the DCT is another way (the FFT is an optimized for computers Fourier transform using sines, while DCT uses cosines). You might use the DCT if you have say, DCT hardware available like on a GPU (video encoders and decoders generally use the DCT over the FFT as the DCT's first parameter gets you the DC level)

    3. Re:Prime95 is now an industry? by slew · · Score: 4, Informative

      FWIW, your "mathematical" explanation is totally bogus. You appear to have literally no idea what you are saying.

      The reason the FFT works for modular multiplication of *integers* with thousands of bits is that you can pick a radix and a convolution size where you do multi-digit convolution where you don't lose any precision in those thousands of bits. Using a "logarithm" algorithm would require nearly 10x the precision to do modular multiplication on integers and using hw floating point (even long doubles) would be totally useless because it isn't accurate to more precision.

      Also, addition and multiplication in the time domain does NOT magically become multiplication and addition in the frequency domain. Convolution in the time domain becomes multiplication in the frequency domain (that's how the FFT algorithm works: FFT multiply iFFT becomes cheaper than digit convolution when the size of the problem becomes large).

      Finally, although it might be technically possible to use a DCT used in a typical video decoder to do some trivial digit convolution, the precision of a typical video decoder' DCT is only 14-16 bits and limited to 8 points which isn't enough precision to do squat for the modular multiplication needed to search for very large Mersenne Primes (which is what Prime95 program does). Of course you can't even get to the 1D DCT used in GPU hardware accelerators (they are generally hardwired to do 2D DCT only and modern compression algorithms don't even use the DCT anymore).

      Sorry to rain on your parade, but leaving stream of consciousness BS like that around unchallenged risks it getting modded up and makes it harder for people to distinguish the real shit from the BS...

    4. Re:Prime95 is now an industry? by Anonymous Coward · · Score: 0

      So this method isn't affected by the "limit cycle" on the LSB?

    5. Re:Prime95 is now an industry? by tibit · · Score: 1

      It's incorrect to say that FFT is "using sines". FFT is using complex exponents as base functions, while DCT uses real cosine functions. The major practical difference between the two is the discontinuities at the boundaries present in FFT, but absent in DCT. That's what makes DCT easier to apply in compression jobs.

      --
      A successful API design takes a mixture of software design and pedagogy.
  24. Can I just say.. by Anonymous Coward · · Score: 0

    as someone who may have worked on that chip, I am enormously grateful that it is not my shitstorm to clean up.

  25. Comment removed by account_deleted · · Score: 1

    Comment removed based on user account deletion

  26. grammar? by Anonymous Coward · · Score: 0

    I think GIMPS means "works perfectly normally." ESL students and kids who didn't pass High School English shouldn't really be talking to the press, eh?

  27. No, it doesn't work like that. by Anonymous Coward · · Score: 1

    Most processor bugs have nothing to do with the frequency of execution, they're caused by a unique set of circumstances. So when someone says it will happen once out of every billion operations they're making the assumption that you will setup that unique case one out of every billion times. This depends heavily on what you're doing with processor. For example, this bug is a math related operation and chances are that if you put it in one of Google or Netflix web servers it would never hit the bug for the duration of it's use even though its getting hammered... because they're not doing math operations of this nature. However, a math major may hit it 2-3 times a week doing their homework (I was in college during the FDIV bug, my g/f at the time was an engineering student who had a statics simulation that triggered it... I thought it was cool... she did not :p)

  28. are we at risk? by carnivore302 · · Score: 1

    While the bug was discovered using Prime95, it could affect other industries that rely on complex computational workloads, such as scientific and financial institutions.

    How about porn?

    --
    Please login to access my lawn
    1. Re:are we at risk? by Anonymous Coward · · Score: 0

      How about porn?

      Porn should be safe komrade, it is only imperialist terrorists who need to generate complex prime numbers for their illegal encryption attempts at communication without government supervision. Nothing to see here. Sieg heil Obama!

    2. Re:are we at risk? by Anonymous Coward · · Score: 0

      It will blur areas that you don't want to be blurred, and not blur areas that you want to be blurred. For an example of the latter, see [insert goatse link here].

  29. Re:My biggest "contribution to the good of humanit by Anonymous Coward · · Score: 0

    Oh, I know, but I think it is more a case of one more straw to break the camel's back. Back then, PCWeek was a BIG DEAL, and I would be very surprised if it didn't add a significant amout of pressure to the people making the decisions at Intel. Regardless of whether it did or not though, the main thing I was happy about was not the actual recall they did, but rather the fact that they implemented a program to disclose bugs for the CPU's they created going forward (because face it, anything as complicated as a "modern" CPU is going to have some flaws.)

  30. broken crypto by Anonymous Coward · · Score: 0

    Could this be an attempt to break crypto currency?

  31. Re:My biggest "contribution to the good of humanit by wonkey_monkey · · Score: 1

    I told him I had canceled the PC orders I placed and would not buy more of them until the situation was resolved. A short while later, Intel changed their tune and also started being more open with the bugs in their processors

    Before I was born, Britain had never had a female prime minister, America had never had a black president, and the Shah still ruled Iran.

    My birth clearly changed all of this...

    --
    systemd is Roko's Basilisk.
  32. Re: only looking for by Cafe+Alpha · · Score: 1

    weapons of math destruction. 3

  33. I'm typing this on an AMD machine by Cafe+Alpha · · Score: 1

    you insensitive clod!