Intel Skylake Bug Causes PCs To Freeze During Complex Workloads (arstechnica.com)
chalsall writes: Intel has confirmed an in-the-wild bug that can freeze its Skylake processors. The company is pushing out a BIOS fix. Ars reports: "No reason has been given as to why the bug occurs, but it's confirmed to affect both Linux and Windows-based systems. Prime95, which has historically been used to benchmark and stress-test computers, uses Fast Fourier Transforms to multiply extremely large numbers. A particular exponent size, 14,942,209, has been found to cause the system crashes. While the bug was discovered using Prime95, it could affect other industries that rely on complex computational workloads, such as scientific and financial institutions. GIMPS noted that its Prime95 software "works perfectly normal" on all other Intel processors of past generations."
Nah, we blame this one on the NSA, to wit:
It only happens when running complex calculations like Mersenne primes. Who runs such calculations? It isn't the good citizens looking at their Facebook whatever it is that they look at. It's people doing crypto, ie, Terrorists.
So how do we stop Terrorists? Don't let them do complex crypto calculations.
QED.
Faster! Faster! Faster would be better!
Just saw this video
https://www.youtube.com/watch?v=eDmv0sDB1Ak
Gives some insight in to the insanely complex nature of processor design and how absurdly reliable they need to be. Modern computers pretty much expect the CPU to be flawless and that's a daunting task considering their complexity and the staggering amount of computations they perform even in ordinary day-to-day use.
An error that occurs one in a billion operations will happen 3 times a second at 3ghz.
So yeah. Some bugs are gonna happen. Thankfully most can be fixed with microcode updates.
Everything is getting faster. Development cycles are getting shorter, schedules are getting tighter, margins are being trimmed down and testing is taking some of that hit. Software is already brutally paced to the point that customers are now performing QA. We're having to train our customers how to use Bugzilla and we somehow accept this as "Ok". Eventually the pacing will become so brutal that version 2 won't even use the same codebase as version 1. Posting bugs will become useless. Software development velocity is such that no-one wants to write long-lived code anymore.
Once hardware reaches this breakneck prototyping velocity it's going to be the same thing. Defects will become more common. Revisions will become more common. Just hope they don't tell us to change out the mobo each time or we'll never get anything working. Even if the time between revisions stays the same the complexity is going up and I'd expect they're pulling all-nighters just to keep pace. Risk goes up accordingly.
I work on ASIC design, though I am on the Analog side of things. There are more people doing verification than design by roughly 2:1. I am told that in the smaller nodes and more complex designs that the ratio is even higher. Basically you can slap down some RTL code (verilog or VHDL) quickly, but torturing it through all exceptions is very hard. Then you have to synthesize and build it, which can introduce all sorts of timing and parastic kinds of problems that have to be double checked. Finally test vectors have to be created to double check the functionality of every transistor in the design to assure that what was built matches the masks.
It is truly phenominal that anything with Billions of gates ever works at all, let alone with the high yield and relatively low error count we have come to expect.
I've done this.
First, billions of transistors is actually easy - most of the transistors in a modern CPU is actually spent on caches and other memory. Logic itself doesn't have as high a transistor density as you might think. In fact, in practically all ASIC designs, there's so much extra silicon space that they put extra gates there that do nothing but are tied to a logic value. These spare transistors serve to provide "rework" room for the design. If you look at most steppings, you start with A0, then you have A1, A2, ... B0, B1, ... etc. Well, going from A0 to A1 is basically just a metal mask change - they don't change the transistor masks (each mask costs around $100K each, and 10 layer metal designs have often 30+ masks, so a $3M cost before the first silicon is patterned). instead, they rewire the transistors using this spare sea of transistors to fix the issues - hopefully only needing to change 5, maybe 10 masks tops ($1M). When you go from Ax to B0, that implies a complete new mask set - either there are too many fixes, or the design is being revised.
As for simulation, it's multi-stage. First each block is individually tested, and simulated, then it's all brought together and software simulated to check for easy to spot faults and have full inner visibility to see why things are the way they are. The complexity of modern CPUs and SoCs means this is only around 1Hz, usually less, so it's reserved for initial testing and sanity checking test vectors.
The next step is to put in on an accelerator - systems like Cadence's Palladium which can get your clock speeds up to the hundreds of Hz range. The simulation isn't as visible and the timings can be off, but you can functionally check most of the blocks and with careful probes design, bring error cases back to the software model to understand what's going on.
The next stage is FPGA simulation - you're testing the logic itself and FPGAs (we're talking about the ones that cost easily $30K each, and no, you need at least 4 or 8 of them or more - that's a quarter million dollars in FPGAs!). But the system moves to the kHz range to even 1MHz. Which despite its slowness, is actually fast enough to boot an OS like Windows or Linux or run test software so software development for drivers and such can begin. Visibility is limited to whatever probes you could install and whatever debugging tools your FPGA toolset has.
Then it's all laid out and routed and all that, and software simulations are run to verify timings - ensuring there are no setup and hold violations in the final floorplan.
And it's not as bad as you think - each block is quite independent and as long as the interface contract is held (setup and hold, timings and other things for the block), the tools will tell you how close you are to violating the specs for each block. So you can test each block in isolation and as long as the interface contract is held, be assured it will work.
Of course, it won't catch integration errors like ground bounce or other such things that. It's akin more to building a space shuttle or airplane - with the right design, you can get something that works.