Slashdot Mirror


Google Says CPU Patches Cause 'Negligible Impact On Performance' With New 'Retpoline' Technique (theverge.com)

In a post on Google's Online Security Blog, two engineers described a novel chip-level patch that has been deployed across the company's entire infrastructure, resulting in only minor declines in performance in most cases. "The company has also posted details of the new technique, called Retpoline, in the hopes that other companies will be able to follow the same technique," reports The Verge. "If the claims hold, it would mean Intel and others have avoided the catastrophic slowdowns that many had predicted." From the report: "There has been speculation that the deployment of KPTI causes significant performance slowdowns," the post reads, referring to the company's "Kernel Page Table Isolation" technique. "Performance can vary, as the impact of the KPTI mitigations depends on the rate of system calls made by an application. On most of our workloads, including our cloud infrastructure, we see negligible impact on performance." "Of course, Google recommends thorough testing in your environment before deployment," the post continues. "We cannot guarantee any particular performance or operational impact."

Notably, the new technique only applies to one of the three variants involved in the new attacks. However, it's the variant that is arguably the most difficult to address. The other two vulnerabilities -- "bounds check bypass" and "rogue data cache load" -- would be addressed at the program and operating system level, respectively, and are unlikely to result in the same system-wide slowdowns.

58 of 120 comments (clear)

  1. You can't "patch" hardware by Anonymous Coward · · Score: 1, Interesting

    This is a hardware level problem. This will be continued to be exploited pretty much indefinitely. In my estimation this is the single biggest security problem ever created. My advice? Mortgage your house, cash out the retirement fund, and dump it all into AMD. Because Intel is going to be destroyed by lawsuit after lawsuit.

    1. Re: You can't "patch" hardware by Anonymous Coward · · Score: 2, Informative

      You can fix the microcode. You can also include software workarounds for hardware flaws. An example was the Pentium F00F bug, which was addressed by the operating system.

    2. Re:You can't "patch" hardware by supremebob · · Score: 5, Informative

      Geez... You make it sound like this is the first ever time someone has had to write a software patch to bypass a hardware flaw. Driver developers have had to come up with clever workarounds to hardware defects since the the dawn of computing.

      These Intel firmware fixes are just going to become part of yet another security update that will be required to keep systems secure.

    3. Re:You can't "patch" hardware by 110010001000 · · Score: 3, Insightful

      Again: there are no Intel firmware fixes for Meldown. It cannot be fixed without replacing the processor. There are only mitigation workarounds.

    4. Re:You can't "patch" hardware by Anonymous Coward · · Score: 1

      This is a hardware level problem. This will be continued to be exploited pretty much indefinitely.

      Have you looked at the actual retopline patches rather than simply inserting foot? It is an interesting approach to block speculative fetching by using indirect jumps/calls/returns.

    5. Re:You can't "patch" hardware by AvitarX · · Score: 1

      Based in the summary, this is a fix that dramatically reduces the impact of meltdown (too lazy to read up as it doesn't directly impact me), if they found a way to keep meltdown in the lower bound, they're doing alright.

      Lower bound being about 5% (initial patch on a pcid supporting processor was 7% in an artificial postgress benchmark that was more prone to slowdown than real life), if they found a way to get ok'd chips to that point, and shave a little bit off their, it dramatically reduces the problem.

      It pulls them ahead of AMD for single thread at sane price at the very least.

      --
      Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
    6. Re:You can't "patch" hardware by supremebob · · Score: 1

      Sure, but it's kind of like the Intel Pentium F00F bug. The underlying hardware issue will always be there, but the OS kernel can prevent that instruction from being run on the system.

    7. Re:You can't "patch" hardware by admin7087 · · Score: 1

      I don't understand this talk about 'dramatically reducing the problem'. Either there is an exploitable flaw or not. If the fix only makes implementing the type of exploit harder, then it's not going to help at all. Some assembler freak and malware author somewhere in the world will still make it work.

      I'm not claiming that there is no fix, only that mere workarounds may be of limited value. What I've read so far hasn't really reassured me. The same can be said about rowhammer, btw. What's so worrying about these types of attacks is that best practices will not help you against them.

  2. time flies by mapkinase · · Score: 4, Funny

    Pentium 4.99989 disaster seems like yesterday.

    --
    I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.
  3. Or just Buy AMD & get no slow down with more p by Joe_Dragon · · Score: 5, Informative

    Or just Buy AMD & get no slow down with more pci-e lanes.

  4. More lies by 110010001000 · · Score: 1

    This isn't a "chip-level" patch. The spin control here is admirable.

    1. Re:More lies by nyet · · Score: 2

      I definitely don't see how requiring you to replace GCC and recompile every single binary is "chip-level".

    2. Re:More lies by 110010001000 · · Score: 4, Interesting

      It isn't "chip level". The Intel PR spin is out in full effect. Meltdown is a major flaw that can only be fixed by removing the flawed Intel processor and replacing it with a processor that doesn't contain the flaw. If you don't do that, the best you can do is mitigate the effects. There is no microcode fix either. What Google is doing is recompiling everything, which is fine, but hackers aren't going to do that.

    3. Re:More lies by whoever57 · · Score: 1

      Exactly, you can't provide a general fix to chip-level security problems by changes to "programs". People can compile their own programs and have root access on VMs that they control.

      However, Google controls the hypervisor and presumably, it's at this level that the attack can be blocked or mitigated.

      --
      The real "Libtards" are the Libertarians!
    4. Re:More lies by 110010001000 · · Score: 2

      Exactly. The funny thing is these "cloud companies" always control their own infrastructure, so these types of "fixes" make sense. Everyone else is screwed.

    5. Re:More lies by atrex · · Score: 1

      Technically, you'd have to replace the motherboard too. Really no such thing as just replacing the chip since all motherboards are pretty much designed for a specific chip series at this point, unless there exists a chip in the same series without the flaw. I certainly couldn't name a single motherboard where you could choose between installing an LGA 2066 Core i9 and a Socket sTR4 Ryzen Threadripper.

  5. Google's technique requires patching binaries/code by JoeyRox · · Score: 4, Interesting

    Google's technique is to patch binaries so that branches/calls don't use the branch prediction mechanism of the CPU, which has a small performance hit but much smaller than KPTI. I suppose the presumption is that harmful code which uses the technique would have to compile it into their binary since most OS's prevent the self-modification of code segments/TLB entries once they've been placed into memory by the OS loader. But what about code segments generated entirely at runtime, including from interpreters and libraries like libjit?

  6. amd needs desktop level server chips / ipmi boards by Joe_Dragon · · Score: 1

    amd needs desktop level server chips / ipmi boards. Like intel exon-e3

    Ryzen PRO chips fully support ECC so we just need a few boards with IPMI

    ThreadRipper is an nice workstation system.

      Threadripper boards with IPMI will be nice as it has higher clocks with less cores then epyc chips.

    an full eypc board is overkill for smaller site hosts.

  7. Re:Google's technique requires patching binaries/c by Fly+Swatter · · Score: 1

    How is patching software a 'chip-level patch?' Is the summary that wrong?

  8. Re:Google's technique requires patching binaries/c by 110010001000 · · Score: 1

    It works for Google because they run everything on their own infrastructure and have full control over it. They don't run it on someone elses "cloud". Rather ironic.

  9. Re:Or just Buy AMD & get no slow down with mor by RhettLivingston · · Score: 1

    This incident highlights the importance of maintaining vendor diversity in data centers. Modern processors are complex enough that it is not unlikely that any given design has problems waiting to be discovered. It would seem wise for large-scale clients to hedge their bets by having a mix of devices carrying their workload. Imagine the damage if someone discovered a means of bricking Intel processors and added the payload to one of the better viruses.

  10. Re:Idiotic Moderation by 110010001000 · · Score: 5, Insightful

    Because it doesn't make sense: Intel has a KNOWN UNFIXABLE FLAW in Meltdown. It cannot be fixed. You are saying "don't switch to AMD because they might have a major flaw too at some point". Meltdown is a much larger problem than Spectre is.

  11. Re: amd needs desktop level server chips / ipmi bo by 110010001000 · · Score: 5, Informative

    More Intel spin. Spectre and Meltdown are different flaws. Meltdown is severe and unfixable and only affects Intel.

  12. Re: Idiotic Moderation by 110010001000 · · Score: 2

    No one said AMD was a magic bullet, but Intel at this point is bullet ridden. I only use Intel processors myself, but this is a huge flaw.

  13. Retpoline is for Spectre by Anonymous Coward · · Score: 1

    Meltdown patch (KPTI) will still hurt applications with lots of syscalls, or lots of userspace->kernel context switches.

  14. Re:Idiotic Moderation by Anonymous Coward · · Score: 1, Interesting

    A flaw has just been discovered in my car where it has a 20% chance of spontaneously bursting into flames when you turn on the ignition. However, I've decided to keep buying the same model of car because other cars likely have equally severe issues that just haven't been discovered yet.

    Be smart - keep buying shit.

  15. Summary not very helpful, here's my attempt. by PhrostyMcByte · · Score: 5, Informative

    Google has created "retpoline", a technique which allows an indirect branch (e.g. a vtable call) to occur in a way that effectively disables speculative execution by isolating branch target prediction into a safe effectless loop. This addresses Variant 2 (aka Spectre).

    Retpoline does not depend on or assist a CPU or an OS patch: it is done purely at the software level, per-app, by a compiler. There is no simple OS-wide patch.

    Google says a retpoline call has performance "within cycles" of a regular old mispredicted branch. The zero-cost predictions we're used to are a thing of the past, because it effectively forces misprediction. I'd be curious to see a benchmark of an indirection-heavy platform like .NET.

    This does not help address or optimize Variant 3, which is what the big kernel patches for Page Table Isolation are needed for. So, your I/O-dependent apps like databases are still going to take a big performance hit. Nor does it address Variant 1.

    1. Re: Summary not very helpful, here's my attempt. by pop+ebp · · Score: 1

      EXACTLY. The summary is horrible. It made it sound like Google invented a novel technique that makes the KPTI/Variant 3 (Meltdown) mitigation slowdown "negligible". But actually the blog post simply says:

      • They invented a technique called Retpoline that mitigates Variant 2, with negligible performance impact; and
      • When testing KPTI/Variant 3 (Meltdown) mitigation on their own workflows, they found the performance impact negligible.
  16. Google is connected to Intel at the hip by bongey · · Score: 4, Insightful

    Google is dependant on Intel CPUs at the moment and has a vested interest in not saying well our cloud just got 5-30% percent slower.

    1. Re:Google is connected to Intel at the hip by swillden · · Score: 1

      Google is dependant on Intel CPUs at the moment and has a vested interest in not saying well our cloud just got 5-30% percent slower.

      Exactly the same as their competitors, including in-house data centers as well as other cloud providers.

      --
      Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
  17. I think they're not looking at the big picture by Anonymous Coward · · Score: 1

    These three exploits are instances, not three different principles. The principle is the same, and there is no reason to suspect that there won't be more instances that follow that principle. CPUs speculatively execute code and load cache lines based on that execution. Intel CPUs can furthermore access privileged memory when unprivileged code is executed speculatively. That's the principle. The way the speculatively executed code is guarded and the speculative window is widened differs between the three exploits, but if you only protect against these three attacks, you leave the fundamental principle available to different exploitation. KPTI prevents privileged memory accesses from userland code no matter how the CPU is coaxed into making these speculative accesses. It does have a performance impact, but at least it addresses all as yet unknown ways of exploiting speculative execution.

  18. Re:Google's technique requires patching binaries/c by PhrostyMcByte · · Score: 5, Insightful

    Google's technique ... has a small performance hit but much smaller than KPTI.

    Keep in mind Google's technique (retpoline) is not an alternative to KPTI. Retpoline addresses Variant 2. KPTI addresses Variant 3. Both are required.

  19. Seriously misleading by jspenguin1 · · Score: 2

    Not only do they misspell the name of the mitigation technique, the "retpoline" technique only protects against the indirect branch variant of Spectre. The fix for Meltdown is still KPTI, with all the same overhead that involves. The "negligible inpact on performance" is on top of the KPTI changes.

  20. Re: Idiotic Moderation by Anonymous Coward · · Score: 1, Informative

    Forget about the future. What about NOW? If you run Intel you are vulnerable to Meltdown. If you run AMD, you aren't. Meldown is a major bug. And yes, AMD microarchitecture is superior. It isn't affected by Meltdown.

  21. Re: Idiotic Moderation by Anonymous Coward · · Score: 4, Interesting

    I take it you didn't read AMD's press release explaining exactly what you say you want to hear.

    It's true that all processors have errata and can have bugs/flaws/security weaknesses... but, the Meltdown flaw which does not affect AMD is a specific kind which can't affect AMD because of architecture differences. Specifically, AMD checks to make sure user land code doesn't try to access kernel data without the correct permissions before executing predictive branches on it. Intel doesn't -- it goes ahead and runs the illegal code before flagging an exception to dump the branch after the fact. So, for a short time, there's data in cache on an Intel chip that should NOT be there because it should never have been accessed by the system to begin with.... and a specially crafted program can read it before it's flushed. This is because Intel (and ARM and others) chose a certain optimization for their speculative engine while AMD chose a different, more secure architecture.

    https://www.pcgamesn.com/intel...

    AMD's fix is -- no fix needed b/c we weren't stupid enough to let even speculative code run without checking its permissions first.

    Per AMD for the initial Linux kernel patch:

    AMD processors are not subject to the types of attacks that the kernel page table isolation feature protects against. The AMD microarchitecture does not allow memory references, including speculative references, that access higher privileged data when running in a lesser privileged mode when that access would result in a page fault.

    AMD is definitely vulnerable to lesser exploits -- some which are also patched others are mitigated... and some are obfuscated because they are processor generation specific. But, they are not vulnerable for Meltdown or any variant like it by design.

    Now remember... the fix for Meltdown is to flush the cache -- all levels -- when switching from user mode to kernel mode or vice versa.... every single time. That's a heck of a hit for some use cases. I believe Intel has found some ways to mitigate it with their 8th gen core series and will likely tinker with a better patch in the future.

    It is absolutely a great idea to purchase an AMD processor if it suits the needs of one's business for those use cases where it will perform better than an Intel chip that is crippled by this horrendous bug -- all things being equal. Obviously, businesses have contracts with 3rd party suppliers and don't necessarily get to pick and choose every aspect of hardware, nor is AMD a savior necessarily if their total cost of ownership is higher because of servicing more varieties of equipment, dealing with more motherboard types and vendors, electricity / Air conditioning costs, etc.

    One doesn't have to be a shill for AMD to notice it's obvious that Intel has a serious hardware flaw that AMD lacks and while any CPU can have errata, most can be patched with negligible effects. Intel having to flush caches between modes is a serious flaw if one runs programs that switch modes constantly. For average users and even gamers, there's not a huge impact. I'm running the patch right now for Windows and I can tell it affects Virtual Machines and a bit of file serving, but not enough for me to be too upset about it. If I had a high-end cluster for databases, a 20% hit to that would definitely make me want to check out AMD as an alternative... b/c even IF AMD has a bug that needs patching, it's unlikely to ever affect performance like this one does by requiring cache flushes to avoid having processes of user and kernel modes running at the same time for fear of one stealing data from the other.

  22. Re: Or just Buy AMD & get no slow down with mo by Anonymous Coward · · Score: 2, Insightful

    This probably offers a false sense of security. It's very possible that there are bugs lurking in AMD hardware that are just as severe. Just because AMD processors aren't susceptible to Meltdown doesn't mean there aren't other vulnerabilities unique to AMD processors.

    And sticking with Intel even after this patch probably offers a false sense of security. It's very possible that there are more bugs lurking in Intel hardware that are just as severe. Just because Intel processors have been patched for Meltdown doesn't mean there aren't other vulnerabilities unique to Intel processors.

  23. Re: Idiotic Moderation by Shikaku · · Score: 1

    They're both full of bullet holes but AMD at least has less holes in short.

  24. Re:Idiotic Moderation by jittles · · Score: 2

    Because it doesn't make sense: Intel has a KNOWN UNFIXABLE FLAW in Meltdown. It cannot be fixed. You are saying "don't switch to AMD because they might have a major flaw too at some point". Meltdown is a much larger problem than Spectre is.

    Except that I read the write-up by the team and it did NOT say that AMD was immune to Meltdown. It actually said that they were able to get AMD processors to execute the pipelines but were unable to read it before the cache was invalidated. They speculated that a more optimized attack may be able to read the cache but they did not know for sure if it was possible. Thus they were not able to use their existing attack against AMD but that does not mean that it is not possible. AMD claimed that those pipelines would never execute and Google's team claims otherwise.

    Intel claimed that they would have a patch available for 90% of the processors affected by next week. Whether that means they have found some byte code that mitigates the attack vector, or they meant OS level patches to flush the cache on system calls was not clear in the blurb that I read. Either way, Google is still claiming that their patch has negligible effect on their server farms. They have quite a few systems deployed doing all kinds of things. Odds are good that the patch will be negligible for many or most users.

  25. Re:Or just Buy AMD & get no slow down with mor by AHuxley · · Score: 1

    Think of the problem as a Venn diagram and the two CPU "vulnerabilities" as lists of CPU's within the diagram.
    Some cpu generations will have both issues. Some one issue. Very few will not have any problem.

    --
    Domestic spying is now "Benign Information Gathering"
  26. Re: Idiotic Moderation by jezwel · · Score: 4, Insightful

    Is there a compelling reason to believe that AMD processors are less likely to be vulnerable in the future than Intel processors?

    Right now only Intel is massively exposed on one security issue where other manufacturers are not. So yes - this makes it appear that AMD design philosophy values security over performance. Whether that is proved out remains to be seen.

    If one manufacturer is cutting corners with the engineering and the other isn't, then there's a logical reason.

    Intel seems to be the one cutting corners - for decades. You do remember the FDIV and FOOF bugs in early Pentiums? I don't recall other manufacturers having such severe problems (sure, mainly PR with FDIV) that a recall was required.

    Otherwise, there isn't a logical basis for using that as a reason to change your behaviour in the future.

    Intel cannot provide CPUs to retail without this flaw for another 18 months or so. That should most certainly influence short-term future behaviour IF the fix causes significant performance issues with your workload.

    It's also entirely possible that, faced with backlash and distrust, the manufacturer might take additional steps to ensure that no such similar issues occur in the future. If there was demonstrable evidence of this, it might be a good reason not to switch.

    Sounds strange to not switch to a vendor that doesn't suffer from this vulnerability, in the hope that Intel will fix it's processes to ensure this doesn't happen again. Right now though, there's no good reason to specify Intel for your CPUs.

    The important question is whether there is any reason to believe Intel processors will be more vulnerable in the future.

    Why is that important? All manufacturers will have problems. You make plans with known data today. Intel messed up big time, and until the problem is fixed they should absolutely have this issue in the 'known problems' pile when consideration of CPU choice is done.

  27. Re:Patched OS tables? by AvitarX · · Score: 1

    I don't think many ARM CPUs use out of order.

    Posting primarily to be corrected if I'm wrong.

    --
    Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
  28. Re: Idiotic Moderation by Anonymous Coward · · Score: 1, Interesting

    The shill gets modded up while posts get modded down for pointing out why the shill is giving bad advice.

    Given the choice of buying one of two x86-64 processors you would choose the Intel one that has a known critical security flaw that can only be mitigated with a performance crippling software patch rather than the AMD one that does not have this flaw. I think it's quite obvious who the shill is on this one and he/she is in some pretty serious damage control at the moment.

  29. Re:Idiotic Moderation by AvitarX · · Score: 1

    Based on what I read.

    AMD said they're immune (to meltdown because they keep the protection of kernel memory more strict
    Intel said 90% of last five years, not 90% of vulnerable.

    This isn't to shit on intel, the 5ish percent slow down on COUs that support PCID isn't so bad, just a clarification of how I've understood the news.

    --
    Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
  30. Just installed the Win 10 patch on my i5 7500 by rsilvergun · · Score: 1

    little or no hit to passmark performance. I haven't got any games installed to test at the moment but passmark's GPU/CPU usually give me a good idea where I'd wind up. My VMs are running fine too.

    --
    Hi! I make Firefox Plug-ins. Check 'em out @ https://addons.mozilla.org/en-US/firefox/addon/youtube-mp3-podcaster/
    1. Re:Just installed the Win 10 patch on my i5 7500 by bad-badtz-maru · · Score: 1

      SSD IO seem to get hit the hardest. check there and see where you're at. On an ancient dual Xeon system I took a 30% hit.

  31. Re: Idiotic Moderation by MakerDusk · · Score: 1

    Relax. With all the bad press, is it really surprising that Intel has resorted to the sponsorship of defamatory posts? They have no other recourse for spinning this in a favorable light. Another bad decision on their part, but seemingly par for the course.

  32. How about compressed/encrypted code? by mnemotronic · · Score: 1

    How does the RETPOLINE mitigation applied to binaries deal with dynamically (JIT) de-compressed or unencrypted code? The ability for speculative pre-fetching to gather data that's normally off-limits to a process seems like a huge can of worms for code that can be pre-processed by the mitigation.

    /unrelated/ I'm not up-to-speed on webasm, but I can see how a vuln might be crafted from an instruction stream since the assembly generator is (presumably) following a recipe.

    --
    The Russians have won. They have made the world a cesspool of distrust, greed, fear and hate.
  33. Re: amd needs desktop level server chips / ipmi bo by Anonymous Coward · · Score: 2, Informative

    Sorry, but ARM says it does apply to some of the ARM models. Variant 3: rogue data cache load (CVE-2017-5754) is Meltdown.
    https://developer.arm.com/support/security-update

    For AMD's sake, I hope their assessment about Ryzen's different architecture is 100% correct. If someone should come up with a POC working on these, AMD would be completely screwed.

    "Lesser" is subjective. It appears that Meltdown can be mitigated if not negated by the KAISER patches to operating systems but Spectre needs to have software (and not only kernels) recompiled or partially rewritten.

    CAPTCHA: surgeons

  34. Compiler support? by Chrisq · · Score: 1

    It would be good to have speculative execution protection as a compiler option rather than as a patch to binaries. This could tune the protection to what is necessary for each specific processor.

  35. Re:Idiotic Moderation by Anonymous Coward · · Score: 5, Informative

    Correction, they speculated that they were able to get AMD chips to do that. Their toy attack (within process) succeeded showing AMD chips will do speculative ordering. No actual security risk there, beause processes can read their own memory.

    BUT, they didn't know for a fact why they didn't succeed in attacking the kernel.

    We've now had statements from AMD (after the paper was released) - namely, that permission bits are checked BEFORE issuing instructions so kernel memory isn't readable, even speculatively.

    So.. .yeah, remember the paper is only what they think could be happening.

  36. Re:Idiotic Moderation by Anonymous Coward · · Score: 3, Interesting

    AMD pushed a patch [1] to disable the workaround for Meltdown on AMD CPUs. That means they are 100% sure that their CPUs are immune.

    [1] https://lkml.org/lkml/2017/12/27/2

  37. Re:Idiotic Moderation by Ash-Fox · · Score: 1

    As long as they aren't worse security wise, I'd encourage people to consider buying them, if for no other reason to try to encourage some competition.

    There are more eyes on Intel processors than AMD at the moment. I don't know what is worse on the security spectrum.

    --
    Change is certain; progress is not obligatory.
  38. Re:Or just Buy AMD & get no slow down with mor by swb · · Score: 1

    I think this would make sense if you had the vendors at rough sales parity and the virtualization vendors had healthy experience on both platforms so all the gotchas of moving live workloads between CPU vendors were understood and mitigated.

    It might actually not work well or require heterogeneous vendor-specific clusters to avoid CPU feature masking that dumbed both vendor platforms to some lowest common denominator.

  39. Re:Idiotic Moderation by admin7087 · · Score: 1

    Security assessments need to be based on evidence, not speculation. In that respect GP's advice was perfectly sound. However, this may be a case where waiting a bit might help to get a better picture.

  40. Re: Idiotic Moderation by atrex · · Score: 1

    I can understand the potential severity of the reported issue, but, iirc doesn't it also rely on the attacker having penetrated the system far enough to have permission to execute malicious code? I don't think it's something that they could manage with javascript in a rogue ad on facebook.

  41. Re: Idiotic Moderation by epine · · Score: 1

    Intel seems to be the one cutting corners - for decades. You do remember the FDIV and FOOF bugs in early Pentiums?

    I recall the FDIV bug quite well, and it had nothing to do with cutting corners. The design of the circuit was correct. In the transfer to manufacturing, some relatively insignificant bits in a hardware lookup table were truncated erroneously. The rarity of the failures allowed the mishap to escape detection in the validation phase.

    Intel's test probably should have been stronger in this area, but that's an awfully easy thing to say in hindsight concerning the validation of extraordinarily complex designs.

    Nostradamus: "There's a horrible bug in this design, and if you double your test coverage from stem to stern, you'll probably find it."

    Intel: "Gee, thanks, Nostradamus. Invest another $10 million and wind up a year late. I think we'll pass on the engineering, and expand our PR team by one full-time professional bullshitter."

    Nostradamus: "So be it. For what it's worth, I also wrote this nice quatrain on the horrors of speculation."

    Intel: "We'll pass."

    Nostradamus: "No, you won't."

    Intel has been many things over the years (with a weird, clockwork heel-turn), but skimping on validation is pretty much the last thing on my list of Intel malfeasance.

    i860
    RDRAM
    Caminogate
    Itanium
    general crisis-management ethos

    Oral History of John H. Crawford 2014 Computer History Museum Fellow — 2014

    I recall that as a great read. From my own notes:

    Big numeric coprocessor redesign as part of the Pentium. This lead to the world-famous Pentium FDIV bug. He claims that transcendentals were easy to test on existing software, but most software took extraordinary efforts to avoid division, so that coverage was extremely thin at this testing layer by comparison.

    I think that discussion also covers the i860, a litany of terror.

    Intel i860

    The Intel i860 (also known as 80860) was a RISC microprocessor design introduced by Intel in 1989.

    It was one of Intel's first attempts at an entirely new, high-end instruction set architecture since the failed Intel i432 from the 1980s. It was released with considerable fanfare, slightly obscuring the earlier Intel i960, which was successful in some niches of embedded systems, and which many considered to be a better design. The i860 never achieved commercial success and the project was terminated in the mid-1990s....

    On paper, performance was impressive for a single-chip solution; however, real-world performance was anything but.

    One problem, perhaps unrecognized at the time, was that runtime code paths are difficult to predict, meaning that it becomes exceedingly difficult to order instructions properly at compile time. For instance, an instruction to add two numbers will take considerably longer if the data are not in the cache, yet there is no way for the programmer to know if they are or not. If an incorrect guess is made, the entire pipeline will stall, waiting for the data.

    The entire i860 design was based on the compiler efficiently handling this task, which proved almost impossible in practice. While theoretically capable of peaking at about 60-80 MFLOPS for both single precision and double precision for the XP versions, hand-coded assemblers managed to get only about up to 40 MFLOPS, and most compilers had difficulty getting even 10 MFLOPs.

    The later Itanium architecture, also a VLIW design, suffered again from the problem of compilers incapable of delivering optimized (enough) code.

    Another serious problem was the lack of any solution to handle context switching quickly. The i860 had several pipelines (for the ALU and FPU parts) and an interrupt could spill them and require them all to be

  42. Re:Or just Buy AMD & get no slow down with mor by RhettLivingston · · Score: 1

    Google, Microsoft, and Amazon dwarf Intel. They should not be waiting around for sales parity. They should be creating vendors if the vendors they need aren't there.

    In past industries, powerful industries would foster competition amongst their suppliers even if it involved significant loss. It is a necessary business expense that leads to many benefits including competition, diversity in supply (we are vulnerable to terrorists taking out foundries and countries cutting chip supplies today), and diversity in design that helps with problems like the one we just encountered.

    I'm not sure why tech operations don't concern themselves as much with this though perhaps they are starting to. It may be a maturity thing. There seem to be more cases of manufacturers using multiple suppliers cropping up lately. Apple intentionally uses both Intel and Qualcomm in phones. Samsung is using both a Qualcomm processor and one of their own design in the S9 generation.

    In the data center arena we may be at a threshold. There is renewed competition from AMD and long-shot entries like Qualcomm's 48-core ARM chip. There are also efforts such as Google's TPU to make huge efficiency gains with custom silicone. Those efforts could spread to asking themselves whether they could create a better processor or offload other computation loads onto custom silicone. They can afford to spend a lot of dough to save power or protect their business.

    Hopefully, this problem will end up being the catalyst for the big data center operators to do whatever it takes to foster competitors. Such a critical market should have at least three viable suppliers with very different designs and diverse manufacturing centers.

  43. Re:Or just Buy AMD & get no slow down with mor by swb · · Score: 1

    My guess is that the broadest explanation is that Google, Microsoft and Amazon largely want x86 compatibility because of the efficiencies associated with the network effect of a widely adopted processor, both in terms of software availability and in terms of platform stability.

    As AMD (and failed competitors) have shown, a competing platform to Intel's CPUs isn't easy to pull off. Google, et al, could pay a subsidy to AMD to produce a competing product but there's no guarantee they would get one and they would probably rather spend that money investigating a competitive product they alone would benefit from (like a custom ARM design for their own data center use).