But there's a related question of whether the worker's productivity is actually greater than the cost of feeding and housing them. What happens when it isn't? In the first industrial revolution, the output of weavers using hand-operated looms dropped below the cost of providing them with food because the value of the thing that they were creating was significantly inflated by its scarcity, and that went away with mechanical looms. What should society do with people in this situation? Some you can retrain, but not all.
My Moto G didn't get first-party software update for long enough to notice this, but when I switched to LineageOS it got faster. New versions of Android have a faster JIT and AOT compiler for Java than the one that my phone originally shipped with. Much of this performance is eaten by more complex apps, but the apps that haven't suffered from creeping featureitus are noticeably faster.
RISC-V hasn't forked, because RISC-V is a specification, not an implementation. For a healthy CPU ecosystem, you want a large number of competing implementations. When Intel, Cyrix, AMD, and IDT were all producing Pentium chips, prices went down and performance went up a lot. The ARM ecosystem benefits hugely from having ARM, Qualcomm, Apple, Cavium and others all designing different CPUs. Partly the competition helps push down prices, but it also means that there's less of a need for anyone to try to produce one-size-fits-all implementations. Cavium's 48-core server ARM chips aren't competing with Qualcomm's 4-core mobile ones, but both will run the same OS, use the same compilers, and so on. As a result, both benefit from sharing software costs. Similarly, some RISC-V vendors are looking at high-end superscalar designs, some at low-end microcontrollers, and so on. The open source versions targeting different markets; however, can share components. Rocket and BOOM share execution pipelines, for example, but Rocket has a single one with a fairly simple register file, whereas BOOM has a scheduler and register rename engine attached and instantiates multiple copies of the Rocket pipelines.
And I'd be happy too if you were selling me beef at $0.50/pound. Even if you told me you weren't washing out the containers every single time and those sorts of time savings were why the beef were so cheap. Until, of course, it turns out that people weren't washing their hands and failing to wear gloves all the time.
This isn't some kind of secret thing that Intel does. They announced it with great fanfare in the '90s and had a bunch of tech news articles about how clever it is. None of the people who are now saying 'bad Intel sold me a thing that I didn't know was insecure but they should have done' came forward to say 'no, please don't do this (or at least let us turn it off) because it's probably insecure'. They all said 'Wow, Intel has made system calls cheaper, that's great!'. Including Linus, who I seem to recall complained in the late '90s about how much slower system calls were on AMD than Intel.
PS - STEALTHMEM: System-Level Protection Against Cache-Based Side Channel Attacks in the Cloud [usenix.org] from 2012 and it sounds like much better performance costs compared to KAISER, although not being a security researcher I have no idea if this would be sufficiently to protest against speculative execution as well.
STEALTHMEM is a protection against a different kind of timing attack and wouldn't protect against Meltdown or Spectre. A huge family of similar mitigations have been proposed, which partition the cache for different security domains. The problem with all of them is that they lead to inefficient cache usage: if one process is memory bandwidth limited and the other is CPU-bound, then this will slow down the bandwidth-limited process a lot, whereas sharing the cache would let it run faster without slowing down the CPU-bound one.
In this case, the attacks use a value that is already in the cache to speculatively do something data-dependent that is measurable, letting you determine what that value was even though you shouldn't have access to it. There are a lot of possible side channels that can be used with this general category of attack. The simplest is to do a data-dependent branch to either a location that's in the cache or one that isn't. Alternatively, doing a data-dependent branch to a divide or an add instruction will give you timing data. There are a bunch of Spectre-like attacks that some of the mitigations won't cover.
I don't know if you've looked at any processor design since the 386, but they all cheat for performance. 95% of the die area of a modern processor is things that cheat for performance. Most of that is caches, that cheap by pretending that main memory is faster than it really is.
The C integration is intended to allow you to expose things like device I/O registers to JavaScript, so that you can then give your customers a generic bit of firmware into which they can load their own JavaScript. You need some C for initial setup, but a bunch of their demos spend most of their time in JavaScript. And, honestly, with cheap M-class ARM cores running at 30-100MHz, JavaScript is probably perfectly adequate for performance (I'm a bit sceptical of a language that doesn't have 64-bit integers, but that's a different issue). These processors are over an order of magnitude faster than the Alto on which bytecode-interpreted Smalltalk ran a full GUI and suite of applications and they're doing far simpler tasks. The only real issue is when you need to do something with realtime guarantees (for example, the LED strip outside my office that we use for Christmas lights is needs to have one of the pins set low and high at specific timings to toggle each LED in the strip), but even then there's only a little bit of code that actually has those timings. For the example of that LED strip, I have a single function that writes an array of colour values to the strip (it felt wasteful to allocate almost half a KB to storing the state of the strip, but I was lazy on a device with 32KB of RAM). If I wanted to use JerryScript instead of the existing C++ code that handles all of the transforms on the lights, then I'd just expose a JavaScript object that wrapped the LED strip and had an update() method to write it out to the strip. Nothing else is timing critical (well, except that the script has to finish update roughly every 100ms to manage the colour changes, but 100ms is a lot of cycles).
A generic solution by definition supports the lowest common denominator across the many use cases
It's not quite that clear-cut. A well-optimised generic solution is often better than a quickly written specific solution. Sometimes, the domain knowledge is useful. For example, if you're sorting a load of integers and you know that they're always going to be in the range 1-10, then a custom bucket sort will always be better than pretty much any generic sorting algorithm (a well-optimised radix sort may happen to give better performance, but it's unlikely).
With memory management, this is rarely the case. Any ad-hoc use of malloc and free (or new and delete) is likely to be no better than using an optimised garbage collector (especially in an environment with concurrency) and must be 100% correct in every case to avoid security-critical bugs. Good C/C++ code that actually needs manual memory management typically uses custom allocators. For example, dovecot provides a separate stack allocator that is not tied to the call stack, so you can cheaply allocate memory that will be returned up the stack. You may only pop things from the stack when you have finished processing return values. This can be a lot faster than a generic garbage collector because it is taking advantage of the fact that some things have statically known lifetimes and are known not to escape. In theory, a compiler could do this automatically (and a few research implementations do), but unless you have a language with a type system that makes this kind of ownership explicit then it's very hard. Similarly, LLVM uses a bunch of bump-the-pointer allocators that release all of their memory at once (and don't bother running destructors) or simply don't release memory when used in a short-lived tool such as clang (the memory allocated by them is almost all live right up until you finish emitting code, and then you exit the process so there's no point in bothering to clean up before then). This is very fast to allocate, and for the typical use case of a compiler is free to deallocate.
Bullshit you say, and yet it's only Intel and a few, comparatively insignificant ARM chips which are affected by meltdown, which btw, was what Linus was referring to.
Ye, because Intel patented the technique and didn't license it to anyone else.
I can only presume AMD is an imaginary entity in your little world, because they apparently managed to solve all these impossible problems without handing out the keys to the kingdom to everyone who asked for them.
Nope, AMD pays a higher penalty on system calls, though they mitigate this to some extent by having shorter and narrower pipelines.
If vendors are responsible for classes of vulnerabilities that are discovered years after they ship their product then we're going to have a very fun few years, before all computer vendors go out of business.
Much like I don't consider a guy with a table saw and staple gun who cuts and assembles pieces of wood a "cabinet maker" if he doesn't know how to make dovetails or french polish, I don't consider someone who assembles pieces of code a "programmer" if he can't handle manual memory management, deterministic performance and pointer arithmetic.
You're conflating 'knows how to' and 'thinks it's a good idea to.' Just because someone can make dovetails and french polish doesn't mean that they won't buy Ikea furniture when there's something that suits their requirements, because it's a waste of their skills to manually build something when a mass-produced off-the-shelf alternative is adequate. Similarly, just because someone can handle manual memory management and pointer arithmetic doesn't mean that it's the best use of their skills to pick a language where they need to: in most cases, it's better for them to pick a language that handles these and for them to focus on data structure and algorithm design.
Scheme is technically a later language, but it's a dialect of Lisp, which was one of the inspirations for Smalltalk (and, in particular, where Smalltalk copied this feature from), so I'm not sure it really counts in this regard.
System calls where always slow because they used to be called via a software interrupt call.
And software interrupts were slow because they were not considered branches by early branch predictors and so triggered a complete pipeline flush equivalent to a branch mispredict (followed immediately by another branch, which SYSCALL removed). Intel addressed this by treating software interrupts as normal branches for the branch predictor, with an extra hint that they changed privilege level. This gave a small improvement to the Pentium, but was a huge boost on the Pentium 4, where the pipelines were long and deep enough that they had up to 140 instructions in flight at a time and having to flush all of those for a system call was painful.
Speculative execution does n't mean we have have this problem, AMD managed to do it fine. No one can say this is by design, if it is by design then it should be documented since 1995 that the MMU protection can be bypassed.
Speculative execution across ring changes is the root cause of this. AMD doesn't do this because Intel patented it, told AMD, and didn't include it in their cross-licensing agreement. You can bet that AMD was just waiting for the patent to expire before doing it, because without it you have to wait until all branches up to the system call have been retired before you can perform the transition. The MMU protection isn't bypassed, because the instructions that would be bypassing the MMU protection are cancelled. There is a side channel that allows you use the changes in cache behaviour to determine what the values in memory would have been.
Was this attack known, and deemed an acceptable risk because of the incredible low rate at which data could supposedly be extracted?
Not this specific attack, but it was known that any source of nondeterminism in a processor was a source of side channels. These were largely ignored because these attacks get lots of papers at top security conferences but are really hard and slow to carry out in practice. Most of the existing attacks used the cache, but there are others involving things like the fact that computation on denormals is much slower than on normal floating point values (a fun one of these lets you scrape browser contents via WebGL and I don't believe has been mitigated yet in spite of being published a couple of years ago).
Speculative execution was known to be a potential source of these side channels for a while. Now that it's shown to be feasible, expect a lot more similar attacks.
And when Intel did this, everyone was happy that the cost of system calls went down. Now everyone is saying that they secretly knew that it was a security issue and only an idiot would have implemented it.
To quote Linus "A *competent* CPU engineer would fix this by making sure speculation doesn't happen across protection domains."
That's bullshit. When Intel introduced speculation across protection domains everyone including Linux was happy because it reduced system call costs. Without this, as soon as you get to a syscall / sysenter instruction, you stall the pipeline until all pending instructions have been committed. On a modern Intel CPU with close to 200 instructions in flight at a time, that's a measurable performance overhead.
We've known for a long time that side channels of this kind were possible, but not that they were performant. The new attacks are not interesting because they're side channels that allow data to be disclosed, they're interesting because they're side channels that allow disclosure far faster than previously believed. CPU designers believed that this kind of attack could only be exploited to get a bit every few seconds, at which rate it's not really worth trying as an attack and is pretty easy for software to spot (hmm, why is this thread at 100% and triggering insane numbers of cache misses? Looks malicious...). Now we know that you can use these attacks to get data at about 0.5MB/s, so you can scan the whole of memory in a few minutes.
Sure it does. If you want to keep something quiet until you are ready to announce it, then you DO NOT tell any of the people who have a track record of spilling the beans.
When has FreeBSD ever disclosed a security vulnerability under embargo? FreeBSD has a security officer and a secteam group that are the only ones that have access to any embargoed security information and have separate infrastructure from the rest of the project for preparing fixes. Only people who have signed the relevant NDAs are allowed access to anything shared with this group and they are normally given information about embargoed security issues as a result.
Regardless of where you personally stand on the idea of embargos and standing up for principles, Theo refused to go along with an embargo previously and it was quite likely that he wouldn't do so this time either
You do realise that FreeBSD and OpenBSD are entirely different projects, run by different people, with different infrastructure and different codebases and that Theo De Raadt has no connection to the FreeBSD project?
Some people bought Netbooks because they were dirt cheap. Mostly these people are now buying larger laptops because the screen costs have gone down so much that it's not worth trying to make everything smaller to save a few dollars, so the cheapest laptops are no longer the smallest.
Some people bought Netbooks because they were small. Most of these people are now using tablets with attached keyboards. My father has a MS Surface that he's happy with: it runs Windows, Office, and all of the business software that he cares about, and is very small. He doesn't need anything particularly fast. That's probably a good upgrade path for anyone who was running Windows on a Netbook and for whom cost is not a primary motivation.
Some people wanted both small and cheap. These people are probably best served with a cheap Android tablet and a folding Bluetooth keyboard. If you want Windows, that's a problem.
That doesn't explain why FreeBSD wasn't notified until 5-6 months after Intel and ARM knew about the issue and until after Apple had shipped a patch. It also wasn't helped that there was no real coordination in releases. Apple shipped a binary update and there were patches in the Linux tree containing mitigation before the official end of the embargo period.
On the other hand, my understanding is that Go's methods are (intentionally) sufficiently dumb that they can be called with an indirect jump that can be reasonably predicted by current breed of CPUs
That depends on whether you use an interface or not. If you don't, then it's equivalent to calling a final Java method (you can tell statically what the destination will be). If you do, then (as with Java) it's an indirect jump via a vtable. The problem with method calling is not so much the cost of the jump, it's the cost of call frame setup and of missed optimisation opportunities. For a small method, you may end up doing 2-3 instructions of real work, but 10-15 instructions of setup. Even if the jump is free, you're still paying a big penalty for invoking it at all. You're also missing out on later optimisation opportunities and the ability to do things like interleaving two dependent memory accesses from the method with in-regsiter operations from later or earlier on to help keep pipelines full. These days, the main reason that compilers do inlining is to expose more optimisation opportunities. The same is true of loop unrolling, where modern branch predictors have mostly eliminated the costs of repeated loop iterations (though rename register pressure can still be a problem).
I think a lot of the problem is a feedback cycle that I've complained about before: people write C code, because C is fast. People design processors optimised for C code, because performance-critical code is written in C. This has led to a push for high-levels of instruction-level parallelism (and therefore speculative execution), because that's the easiest way of getting good performance out of a language that's designed to be close to the metal, when the metal in question is a PDP-11. If you designed a processor for a language that had cheap thread creation and enforced immutable-xor-shared, such as Erlang, then you would have a lot of cores, much simpler cache coherency (anything where two cores are accessing the same mutable data is either a bug or a thread migration event, and so can be slow), no speculative execution, no need for high ILP. You might dedicate more transistors to making context switches fast (even allowing cores to have an arbitrarily large pool of threads that they can pull in from memory when some of the ones occupying hardware contexts are blocked). If you don't care about ILP, then suddenly the big advantage of register machine instruction sets over stack machines goes away (ILP from stack machines is hard) and you're left with stack machines giving better code density.
The thing is, we know how to build these machines. There are commercial projects with most of the characteristics that I've outlined and research projects with the others. My hope with the Spectre debacle is that we'll see a some new chips that are faster when running code written in higher-level languages than they are running C programs.
just as no one is seriously going to try to use Javascript in an embedded microprocessor
I draw your attention to JerryScript, developed by Samsung as a lightweight JavaScript interpreter specifically designed for running in embedded microprocessors.
The slowest Raspberry Pi has 512MB of RAM and a 700MHz 32-bit processor. The original Smalltalk-80 implementation ran on a 2MHz 16-bit processor with 512KB of RAM and contained a full graphical user interface and applications written entirely in Smalltalk, a pure object-oriented language that didn't even have concessions to implementation ease like primitive types or intraprocedural flow control[1]. It did use some clever microcode tricks to make things like screen updates faster, but even without these Smalltalk was quite performant on a 20MHz processor.
The idea that a RPi is too slow for a high-level language to be fast enough is astonishing.
[1] In Smalltalk, integers are immutable instances of the SmallInt class, which is typically implemented as a tagged pointer. If integer arithmetic overflows, the result is an immutable instance of the BigInt class, which is stored as a pointer to an arbitrary-precision integer object. It's depressing how later dynamic languages, particularly scripting languages, haven't managed to have as useful integers. Smalltalk also had a variety of floating point types. It did not have things like if statements in the language. True and False were singleton subclasses of the Boolean class, which implemented methods like ifTrue: and ifFalse:. These took a block (closure) as an argument and either executed it or didn't execute it, depending whether they were True or False.
Except, in this example, in pretty much any other language the string representation would be a pair of a pointer to a buffer and a length and the length would be guaranteed by the language runtime to be trusted, so you'd never have to do the equivalent of strlen.
And even with all of the checks that you describe, you're only looking at the most trivial part of the problem: single-threaded execution. In the C memory model, if there's an update to either size or buffer from one thread, without explicitly synchronisation establishing a happens-before relationship with another thread, then the other thread may see mismatched versions of these and see a length that corresponded to a longer buffer. Unless you make your objects immutable (which C doesn't enforce, but at least a custom static analyser can check, in the absence of any use-after-free errors), this kind of bug can be incredibly subtle, but exploitable.
The basic job of a programmer is to automate things: to make a computer do things, rather than a human. That is the entire point of all programming. What kind of programmer, when faced with a problem for which there is an existing generic solution with adequate performance prefers to write an ad-hoc solution? A poor one. Yet that's exactly what you're advocating: your notion of a 'halfway good programmer' is one that doesn't make use of the results of programming.
There are places where manual memory management, deterministic performance, and pointer arithmetic are absolutely essential requirements. In these situations, C or C++ are about the only options (Rust might be, but most of these projects require long-term maintenance and Rust is far too young to consider for that kind of project). For everything else, your choice is either reinvent the wheel in an ad-hoc way, or use one that's been well tested and optimised.
But there's a related question of whether the worker's productivity is actually greater than the cost of feeding and housing them. What happens when it isn't? In the first industrial revolution, the output of weavers using hand-operated looms dropped below the cost of providing them with food because the value of the thing that they were creating was significantly inflated by its scarcity, and that went away with mechanical looms. What should society do with people in this situation? Some you can retrain, but not all.
My Moto G didn't get first-party software update for long enough to notice this, but when I switched to LineageOS it got faster. New versions of Android have a faster JIT and AOT compiler for Java than the one that my phone originally shipped with. Much of this performance is eaten by more complex apps, but the apps that haven't suffered from creeping featureitus are noticeably faster.
RISC-V hasn't forked, because RISC-V is a specification, not an implementation. For a healthy CPU ecosystem, you want a large number of competing implementations. When Intel, Cyrix, AMD, and IDT were all producing Pentium chips, prices went down and performance went up a lot. The ARM ecosystem benefits hugely from having ARM, Qualcomm, Apple, Cavium and others all designing different CPUs. Partly the competition helps push down prices, but it also means that there's less of a need for anyone to try to produce one-size-fits-all implementations. Cavium's 48-core server ARM chips aren't competing with Qualcomm's 4-core mobile ones, but both will run the same OS, use the same compilers, and so on. As a result, both benefit from sharing software costs. Similarly, some RISC-V vendors are looking at high-end superscalar designs, some at low-end microcontrollers, and so on. The open source versions targeting different markets; however, can share components. Rocket and BOOM share execution pipelines, for example, but Rocket has a single one with a fairly simple register file, whereas BOOM has a scheduler and register rename engine attached and instantiates multiple copies of the Rocket pipelines.
And I'd be happy too if you were selling me beef at $0.50/pound. Even if you told me you weren't washing out the containers every single time and those sorts of time savings were why the beef were so cheap. Until, of course, it turns out that people weren't washing their hands and failing to wear gloves all the time.
This isn't some kind of secret thing that Intel does. They announced it with great fanfare in the '90s and had a bunch of tech news articles about how clever it is. None of the people who are now saying 'bad Intel sold me a thing that I didn't know was insecure but they should have done' came forward to say 'no, please don't do this (or at least let us turn it off) because it's probably insecure'. They all said 'Wow, Intel has made system calls cheaper, that's great!'. Including Linus, who I seem to recall complained in the late '90s about how much slower system calls were on AMD than Intel.
PS - STEALTHMEM: System-Level Protection Against Cache-Based Side Channel Attacks in the Cloud [usenix.org] from 2012 and it sounds like much better performance costs compared to KAISER, although not being a security researcher I have no idea if this would be sufficiently to protest against speculative execution as well.
STEALTHMEM is a protection against a different kind of timing attack and wouldn't protect against Meltdown or Spectre. A huge family of similar mitigations have been proposed, which partition the cache for different security domains. The problem with all of them is that they lead to inefficient cache usage: if one process is memory bandwidth limited and the other is CPU-bound, then this will slow down the bandwidth-limited process a lot, whereas sharing the cache would let it run faster without slowing down the CPU-bound one.
In this case, the attacks use a value that is already in the cache to speculatively do something data-dependent that is measurable, letting you determine what that value was even though you shouldn't have access to it. There are a lot of possible side channels that can be used with this general category of attack. The simplest is to do a data-dependent branch to either a location that's in the cache or one that isn't. Alternatively, doing a data-dependent branch to a divide or an add instruction will give you timing data. There are a bunch of Spectre-like attacks that some of the mitigations won't cover.
Intel cheated for performance
I don't know if you've looked at any processor design since the 386, but they all cheat for performance. 95% of the die area of a modern processor is things that cheat for performance. Most of that is caches, that cheap by pretending that main memory is faster than it really is.
The C integration is intended to allow you to expose things like device I/O registers to JavaScript, so that you can then give your customers a generic bit of firmware into which they can load their own JavaScript. You need some C for initial setup, but a bunch of their demos spend most of their time in JavaScript. And, honestly, with cheap M-class ARM cores running at 30-100MHz, JavaScript is probably perfectly adequate for performance (I'm a bit sceptical of a language that doesn't have 64-bit integers, but that's a different issue). These processors are over an order of magnitude faster than the Alto on which bytecode-interpreted Smalltalk ran a full GUI and suite of applications and they're doing far simpler tasks. The only real issue is when you need to do something with realtime guarantees (for example, the LED strip outside my office that we use for Christmas lights is needs to have one of the pins set low and high at specific timings to toggle each LED in the strip), but even then there's only a little bit of code that actually has those timings. For the example of that LED strip, I have a single function that writes an array of colour values to the strip (it felt wasteful to allocate almost half a KB to storing the state of the strip, but I was lazy on a device with 32KB of RAM). If I wanted to use JerryScript instead of the existing C++ code that handles all of the transforms on the lights, then I'd just expose a JavaScript object that wrapped the LED strip and had an update() method to write it out to the strip. Nothing else is timing critical (well, except that the script has to finish update roughly every 100ms to manage the colour changes, but 100ms is a lot of cycles).
A generic solution by definition supports the lowest common denominator across the many use cases
It's not quite that clear-cut. A well-optimised generic solution is often better than a quickly written specific solution. Sometimes, the domain knowledge is useful. For example, if you're sorting a load of integers and you know that they're always going to be in the range 1-10, then a custom bucket sort will always be better than pretty much any generic sorting algorithm (a well-optimised radix sort may happen to give better performance, but it's unlikely).
With memory management, this is rarely the case. Any ad-hoc use of malloc and free (or new and delete) is likely to be no better than using an optimised garbage collector (especially in an environment with concurrency) and must be 100% correct in every case to avoid security-critical bugs. Good C/C++ code that actually needs manual memory management typically uses custom allocators. For example, dovecot provides a separate stack allocator that is not tied to the call stack, so you can cheaply allocate memory that will be returned up the stack. You may only pop things from the stack when you have finished processing return values. This can be a lot faster than a generic garbage collector because it is taking advantage of the fact that some things have statically known lifetimes and are known not to escape. In theory, a compiler could do this automatically (and a few research implementations do), but unless you have a language with a type system that makes this kind of ownership explicit then it's very hard. Similarly, LLVM uses a bunch of bump-the-pointer allocators that release all of their memory at once (and don't bother running destructors) or simply don't release memory when used in a short-lived tool such as clang (the memory allocated by them is almost all live right up until you finish emitting code, and then you exit the process so there's no point in bothering to clean up before then). This is very fast to allocate, and for the typical use case of a compiler is free to deallocate.
Bullshit you say, and yet it's only Intel and a few, comparatively insignificant ARM chips which are affected by meltdown, which btw, was what Linus was referring to.
Ye, because Intel patented the technique and didn't license it to anyone else.
I can only presume AMD is an imaginary entity in your little world, because they apparently managed to solve all these impossible problems without handing out the keys to the kingdom to everyone who asked for them.
Nope, AMD pays a higher penalty on system calls, though they mitigate this to some extent by having shorter and narrower pipelines.
If vendors are responsible for classes of vulnerabilities that are discovered years after they ship their product then we're going to have a very fun few years, before all computer vendors go out of business.
Much like I don't consider a guy with a table saw and staple gun who cuts and assembles pieces of wood a "cabinet maker" if he doesn't know how to make dovetails or french polish, I don't consider someone who assembles pieces of code a "programmer" if he can't handle manual memory management, deterministic performance and pointer arithmetic.
You're conflating 'knows how to' and 'thinks it's a good idea to.' Just because someone can make dovetails and french polish doesn't mean that they won't buy Ikea furniture when there's something that suits their requirements, because it's a waste of their skills to manually build something when a mass-produced off-the-shelf alternative is adequate. Similarly, just because someone can handle manual memory management and pointer arithmetic doesn't mean that it's the best use of their skills to pick a language where they need to: in most cases, it's better for them to pick a language that handles these and for them to focus on data structure and algorithm design.
Scheme is technically a later language, but it's a dialect of Lisp, which was one of the inspirations for Smalltalk (and, in particular, where Smalltalk copied this feature from), so I'm not sure it really counts in this regard.
System calls where always slow because they used to be called via a software interrupt call.
And software interrupts were slow because they were not considered branches by early branch predictors and so triggered a complete pipeline flush equivalent to a branch mispredict (followed immediately by another branch, which SYSCALL removed). Intel addressed this by treating software interrupts as normal branches for the branch predictor, with an extra hint that they changed privilege level. This gave a small improvement to the Pentium, but was a huge boost on the Pentium 4, where the pipelines were long and deep enough that they had up to 140 instructions in flight at a time and having to flush all of those for a system call was painful.
Speculative execution does n't mean we have have this problem, AMD managed to do it fine. No one can say this is by design, if it is by design then it should be documented since 1995 that the MMU protection can be bypassed.
Speculative execution across ring changes is the root cause of this. AMD doesn't do this because Intel patented it, told AMD, and didn't include it in their cross-licensing agreement. You can bet that AMD was just waiting for the patent to expire before doing it, because without it you have to wait until all branches up to the system call have been retired before you can perform the transition. The MMU protection isn't bypassed, because the instructions that would be bypassing the MMU protection are cancelled. There is a side channel that allows you use the changes in cache behaviour to determine what the values in memory would have been.
Was this attack known, and deemed an acceptable risk because of the incredible low rate at which data could supposedly be extracted?
Not this specific attack, but it was known that any source of nondeterminism in a processor was a source of side channels. These were largely ignored because these attacks get lots of papers at top security conferences but are really hard and slow to carry out in practice. Most of the existing attacks used the cache, but there are others involving things like the fact that computation on denormals is much slower than on normal floating point values (a fun one of these lets you scrape browser contents via WebGL and I don't believe has been mitigated yet in spite of being published a couple of years ago).
Speculative execution was known to be a potential source of these side channels for a while. Now that it's shown to be feasible, expect a lot more similar attacks.
And when Intel did this, everyone was happy that the cost of system calls went down. Now everyone is saying that they secretly knew that it was a security issue and only an idiot would have implemented it.
To quote Linus "A *competent* CPU engineer would fix this by making sure speculation doesn't happen across protection domains."
That's bullshit. When Intel introduced speculation across protection domains everyone including Linux was happy because it reduced system call costs. Without this, as soon as you get to a syscall / sysenter instruction, you stall the pipeline until all pending instructions have been committed. On a modern Intel CPU with close to 200 instructions in flight at a time, that's a measurable performance overhead.
We've known for a long time that side channels of this kind were possible, but not that they were performant. The new attacks are not interesting because they're side channels that allow data to be disclosed, they're interesting because they're side channels that allow disclosure far faster than previously believed. CPU designers believed that this kind of attack could only be exploited to get a bit every few seconds, at which rate it's not really worth trying as an attack and is pretty easy for software to spot (hmm, why is this thread at 100% and triggering insane numbers of cache misses? Looks malicious...). Now we know that you can use these attacks to get data at about 0.5MB/s, so you can scan the whole of memory in a few minutes.
Sure it does. If you want to keep something quiet until you are ready to announce it, then you DO NOT tell any of the people who have a track record of spilling the beans.
When has FreeBSD ever disclosed a security vulnerability under embargo? FreeBSD has a security officer and a secteam group that are the only ones that have access to any embargoed security information and have separate infrastructure from the rest of the project for preparing fixes. Only people who have signed the relevant NDAs are allowed access to anything shared with this group and they are normally given information about embargoed security issues as a result.
Regardless of where you personally stand on the idea of embargos and standing up for principles, Theo refused to go along with an embargo previously and it was quite likely that he wouldn't do so this time either
You do realise that FreeBSD and OpenBSD are entirely different projects, run by different people, with different infrastructure and different codebases and that Theo De Raadt has no connection to the FreeBSD project?
Some people bought Netbooks because they were dirt cheap. Mostly these people are now buying larger laptops because the screen costs have gone down so much that it's not worth trying to make everything smaller to save a few dollars, so the cheapest laptops are no longer the smallest.
Some people bought Netbooks because they were small. Most of these people are now using tablets with attached keyboards. My father has a MS Surface that he's happy with: it runs Windows, Office, and all of the business software that he cares about, and is very small. He doesn't need anything particularly fast. That's probably a good upgrade path for anyone who was running Windows on a Netbook and for whom cost is not a primary motivation.
Some people wanted both small and cheap. These people are probably best served with a cheap Android tablet and a folding Bluetooth keyboard. If you want Windows, that's a problem.
That doesn't explain why FreeBSD wasn't notified until 5-6 months after Intel and ARM knew about the issue and until after Apple had shipped a patch. It also wasn't helped that there was no real coordination in releases. Apple shipped a binary update and there were patches in the Linux tree containing mitigation before the official end of the embargo period.
On the other hand, my understanding is that Go's methods are (intentionally) sufficiently dumb that they can be called with an indirect jump that can be reasonably predicted by current breed of CPUs
That depends on whether you use an interface or not. If you don't, then it's equivalent to calling a final Java method (you can tell statically what the destination will be). If you do, then (as with Java) it's an indirect jump via a vtable. The problem with method calling is not so much the cost of the jump, it's the cost of call frame setup and of missed optimisation opportunities. For a small method, you may end up doing 2-3 instructions of real work, but 10-15 instructions of setup. Even if the jump is free, you're still paying a big penalty for invoking it at all. You're also missing out on later optimisation opportunities and the ability to do things like interleaving two dependent memory accesses from the method with in-regsiter operations from later or earlier on to help keep pipelines full. These days, the main reason that compilers do inlining is to expose more optimisation opportunities. The same is true of loop unrolling, where modern branch predictors have mostly eliminated the costs of repeated loop iterations (though rename register pressure can still be a problem).
I think a lot of the problem is a feedback cycle that I've complained about before: people write C code, because C is fast. People design processors optimised for C code, because performance-critical code is written in C. This has led to a push for high-levels of instruction-level parallelism (and therefore speculative execution), because that's the easiest way of getting good performance out of a language that's designed to be close to the metal, when the metal in question is a PDP-11. If you designed a processor for a language that had cheap thread creation and enforced immutable-xor-shared, such as Erlang, then you would have a lot of cores, much simpler cache coherency (anything where two cores are accessing the same mutable data is either a bug or a thread migration event, and so can be slow), no speculative execution, no need for high ILP. You might dedicate more transistors to making context switches fast (even allowing cores to have an arbitrarily large pool of threads that they can pull in from memory when some of the ones occupying hardware contexts are blocked). If you don't care about ILP, then suddenly the big advantage of register machine instruction sets over stack machines goes away (ILP from stack machines is hard) and you're left with stack machines giving better code density.
The thing is, we know how to build these machines. There are commercial projects with most of the characteristics that I've outlined and research projects with the others. My hope with the Spectre debacle is that we'll see a some new chips that are faster when running code written in higher-level languages than they are running C programs.
just as no one is seriously going to try to use Javascript in an embedded microprocessor
I draw your attention to JerryScript, developed by Samsung as a lightweight JavaScript interpreter specifically designed for running in embedded microprocessors.
The idea that a RPi is too slow for a high-level language to be fast enough is astonishing.
[1] In Smalltalk, integers are immutable instances of the SmallInt class, which is typically implemented as a tagged pointer. If integer arithmetic overflows, the result is an immutable instance of the BigInt class, which is stored as a pointer to an arbitrary-precision integer object. It's depressing how later dynamic languages, particularly scripting languages, haven't managed to have as useful integers. Smalltalk also had a variety of floating point types. It did not have things like if statements in the language. True and False were singleton subclasses of the Boolean class, which implemented methods like ifTrue: and ifFalse:. These took a block (closure) as an argument and either executed it or didn't execute it, depending whether they were True or False.
Except, in this example, in pretty much any other language the string representation would be a pair of a pointer to a buffer and a length and the length would be guaranteed by the language runtime to be trusted, so you'd never have to do the equivalent of strlen.
And even with all of the checks that you describe, you're only looking at the most trivial part of the problem: single-threaded execution. In the C memory model, if there's an update to either size or buffer from one thread, without explicitly synchronisation establishing a happens-before relationship with another thread, then the other thread may see mismatched versions of these and see a length that corresponded to a longer buffer. Unless you make your objects immutable (which C doesn't enforce, but at least a custom static analyser can check, in the absence of any use-after-free errors), this kind of bug can be incredibly subtle, but exploitable.
The basic job of a programmer is to automate things: to make a computer do things, rather than a human. That is the entire point of all programming. What kind of programmer, when faced with a problem for which there is an existing generic solution with adequate performance prefers to write an ad-hoc solution? A poor one. Yet that's exactly what you're advocating: your notion of a 'halfway good programmer' is one that doesn't make use of the results of programming.
There are places where manual memory management, deterministic performance, and pointer arithmetic are absolutely essential requirements. In these situations, C or C++ are about the only options (Rust might be, but most of these projects require long-term maintenance and Rust is far too young to consider for that kind of project). For everything else, your choice is either reinvent the wheel in an ad-hoc way, or use one that's been well tested and optimised.