Intel Flagship Core i7-6950X Broadwell-E To Offer 10-Cores, 20-Threads, 25MB L3 (hothardware.com)
MojoKid writes: Intel has made a habit of launching enthusiast versions of previous generation processors after it releases a new architecture. As was the case with Intel's Haswell architecture, high-end Broadwell-E variants are expected and it looks like Intel is readying a doozy. Recently revealed details show four new processors under the new HEDT (High-End Desktop) banner for Broadwell, which is one more SKU than Haswell-E brought to the table. The most intriguing of the new chips is the Core i7-6950X, a monster 10-core CPU with Hyper Threading support. That gives the Core i7-6950X 20 threads to play with, along with a whopping 25MB of L3 cache. The caveat is the CPU's clockspeed — it will run at just 3.0GHz (base), so for applications that aren't properly tuned to take full advantage of large core counts and threads, it could potentially trail behind the Core i7-6700K, a quad-core Skylake processor clocked at 3.4GHz (base) to 4GHz (Turbo).
Mainstream programming languages are still sequential by default and the likes of OpenCL are too hard to learn for simple tasks. UI code is still single threaded in most systems, and that drags most computation into that thread as well through programmer laziness. It's time for languages which are parallel by default and where ability to parallelize a loop is verifiable at compile time. Yes I know FORTRAN is much closer to that then C/Java, but that's due to being primitive to a degree that will not fly in 2015.
How is your SSD speed and thermal envelope doing with make -j20?
I have been seeing this a lot lately for some reason. The i7 6700K runs at 4.0 ghz base clock and turbo's up to 4.2. So it will be quite a performance beatdown by Skylake if clockspeed instead of threads is important.
You can pick up hex core boxes dirt cheap on eBay now. Look for the X5650 or W3680 Xeon boxes.
Only the State obtains its revenue by coercion. - Murray Rothbard
I was under the (admittedly vague) impression that was true only if the thread was using floating point.
CPUs that offer more cores and/or threads than they do FPUs is one of the reasons I write a lot of my multi-threaded stuff (image and baseband RF processing) utilizing appropriately scaled integer math.
I have 8 cores with 8 FPUs on my desk, but many of my users are stuck with some of the wheezier I5 variants.
I've fallen off your lawn, and I can't get up.
Both of which have such slow single threaded performance that most people are better off with an i3.
I have a W3680 box that runs 8 VDI boxes through citrix xendesktop and the performance is pretty good for that purpose but for gaming its 25-30% slower than my dual core i3 4160 in my cheapy shit gaming box
C'mon Intel, everyone knows that 400 thread count is the minimum needed for a good night's sleep.
Good question - it might end up clamped by other things, but at least up to -j 8, it scales decently well on a current CPU. It's still a big win over -j 4, say.
Compilation has a lot of non-local data access that pays a heavy price for cache misses, so it's not so much memory bandwidth sensitive as it seems. During those cache miss latency periods, other cores can still be doing something.
It might be that 20 is too much to hope for, I don't know. I'm pretty sure it'll scale beyond 8 though.
Assuming Intel doesn't go Xeon-scale in pricing for this CPU (who am I kidding, of course they will) I wonder how AMD plans to respond to this.
For now, they've got the consoles holding them afloat. And while I am an AMD fan, I see they are rapidly losing out on the desktop space when it comes to performance (despite both companies having rather meager performance gains for the past several years.)
They'd better figure out what the fuck they're doing, and come up with some competing responses, quickly. Hell, I've got ideas for them, all involving that HBM tech.
1. Use a modified version of that HBM tech to stack their CPU cores and load it up with tons of cache memory (for their non-APU line.) And don't forget to drop a process node, for fuck's sake.
2. Use modified HBM tech to create stacked CPU/GPU/RAM/CACHE on the same die (for their APU line.)
3. Use modified HBM to create stacked single-die CrossFire GPUs that don't consume gobs of power (GPU line.)
4. Use modified HBM tech to create a true monolithic SOC package that integrates EVERYTHING, thus eliminating the need for motherboards - at that point and time, it just becomes a breakout board with a socket. They could probably do away with the interposer as well if They were clever enough in the design.
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
As a side benefit, "appropriately scaled integer math" can be better in other ways too, depending on the nature of your problem. It doesn't pile all its precision up right near zero, but gives you an even distribution of the precision across the range. That might (or might not) be a much better fit for your problem, depending on what you're doing.
Apparently the last time you checked was in 2004. Hyper threading gets you a lot more than a 10% gain on modern CPUs.
Imagine a Beowulf Cluster of these!
Chas - The one, the only.
THANK GOD!!!
Thank you for sharing. I was lost without your insightful comment!
"Hi. We're from Intel, and we'd like to take a look at your multithreading, such as it is."
You think your CPU isn't good enough? I'm still using a Core 2 Duo here!
It's not that straight-forward. Depends on how much time the threads spend in the cache, and how much time they spend waiting on the FPU.
I've fallen off your lawn, and I can't get up.
Yep. Although I have to say, I really enjoy working with the domain 0.0-1.0; there are so many neat tricks that can be pulled. You can do them in integer too, but there is hoop-jumping involved.
I've fallen off your lawn, and I can't get up.
20-thread is more normally seen as "20 TPI" or thread per inch. It is a standard of thread pitch for bolts.https://en.wikipedia.org/wiki/Screw_thread#Lead.2C_pitch.2C_and_starts
This is all well and good but I have to wonder: is this thing still optimized for single-threaded performance??
How much does it really matter anymore? For 99% of the population, any top end computer built in the last 5+ years is so damn fast, it will be fine for the next 10 years. Unless we see the 100x (or whatever) increase with quantum computing, these small incremental improvements are fairly pointless.
If you could reason with religious people, there would be no religious people
Is it really 10 cores / 20 threads for real? Or will it reduce the core frequency even further by like 50% if you try to use all of them at once for more than a second, to say within TDP envelope and not to melt?
Actually, parallel builds barely touch the storage subsystem. Everything is basically cached in ram and writes to files wind up being aggregated into relatively small bursts. So the drives are generally almost entirely idle the whole time.
It's almost a pure-cpu exercise and also does a pretty good job testing concurrency within the kernel due to the fork/exec/run/exit load (particularly for Makefile-based builds which use /bin/sh a lot). I've seen fork/exec rates in excess of 5000 forks/sec during poudriere runs, for example.
-Matt
You likely have not checked for a while. I saw figures of 120% performance ("each core at 60% performance" as you put it) back under the Pentium 4 HT, 140% under Nehalem/Sandy Bridge, and 150% under Haswell.
Don't worry. Windows will bloat to fill the additional cores.
We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
Hyperthreading on intel gives about a +30 to +50% performance improvement. So each core winds up being about 1.3 to 1.5 times the performance with two threads verses 1.0 with one. Quite significant. It depends on the type of load, of course.
The main reason for the improvement is of course due to one thread being able to make good use of execution units while the other thread is stalled on something (like memory or TLB, significant integer shifts, or dependent Integer or FPU multiply and divide operations).
-Matt
Why is Intel introducing a new Broadwell processor? Why not Skylake?
Broadwell was a "Tick". Skylake is the improvement called "Tock".
If you run at stock speed, you are right.
If you have the right motherboard and the right CPU, you can get some pretty good overclock on these CPUs (if you are into overclocking). I've got a few X5650s setup at home. I can get 4.2 GHz on one and 4.6 GHz @1.34 volts on the other 24/7. With 6 cores (12 cores /w HT), water cooling is almost a requirement at these speeds (unless you are really lucky with the CPU). Max temps for both is ~75 C in a 20 C room. I got each CPUs on eBay for $60 a piece and I already had the MBs.
These days the most expensive part could be the X58 motherboard unless you already have one. Some of them are selling for more used than their new MSRP. One drawback is most X58 boards don't have USB3 or SATA3. You could always get PCIe cards for those, but the bill of material (BOM) would increase.
If anyone is interested, there's an Anandtech forum thread that has a lot of information on it: http://forums.anandtech.com/sh...
Not only will this thing be great for those of us who love to run PovRay and virtual machines, but hopefully we're well on our way to software designers actually getting comfortable with parallelizing workloads.
It seems like most major software packages finally support 64-bit processing (shaking my head at ArcMap), and even video games are being dragged along by the new wave of consoles. Maybe they'll be more agile in the future.
Do modern, development-focused CS degree programs talk about multiprocessing?
169784
Urm. And you've investigated this and found that your drive is pegged because? Of What? Or you haven't investigated this and you have no idea why your drive is pegged. I'll take a guess... you are running out of memory and the disk activity you see is heavy paging.
Let me rephrase... we do bulk builds with pourdriere of 20,000 applications. It takes a bit less than two days. We set the parallelism to roughly 2x the number of cpu threads available. There are usually several hundred processes active in various states at any given moment. The cpu load is pegged. Disk activity is zero for most of the time.
If I do something less strenuous, like a buildworld or buildkernel, almost the same result. Cpu is mostly pegged, disk activity is zero for the roughly 30 minutes the buildworld takes. However, smaller builds such as a buildworld or buildkernel, or a linux kernel build, regardless of the -j concurrency you specify, will certainly have bottlenecks in the build subsystem that have nothing to do with the cpu. A little work on the Makefiles will solve that problem. In our case there are always two or three ridiculously huge source files in the GCC build that the Make has to wait for before it can proceed with the link pass. Similarly with a kernel build there is a make depend step at the beginning which is not parallelized and the final link at the end which cannot be parallelized which actually take most of the time. Compiling the sources in the middle finishes in a flash.
But your problem sounds a bit different... kinda sounds like you are running yourself out of memory. Parallel builds can run machines out of memory if the dev specifies more concurrency than his memory can handle. For example, when building packages there are many C++ source files which #include the kitchen sink and wind up with process run sizes north of 1GB. If someone only has 8GB of ram and tries a -j 8 build under those circumstances, that person will run out of memory and start to page heavily.
So its a good idea to look at the footprint of the individual processes you are trying to parallelize, too.
Memory is cheap these days. Buy more. Even those tiny little BRIX one can get these days can hold 32G of ram. For a decent concurrent build on a decent cpu you want 8GB minimum, 16GB is better, or more.
-Matt
NVME SSDs can do 100k random IOPs. That's an I/O completing every 10us. That's plenty fast to keep a 20-thread compiling pipeline busy with data.
Interesting link. My Lenovo box has the an X58 motherboard but I don't think the stock bios will allow for much tweaking. This was still a huge speed boost from my old Q6600 box.
Only the State obtains its revenue by coercion. - Murray Rothbard
It depends on the task. For double precision FP calculations using MPI multi-processing (e.g. FORTRAN CFD), the extra overhead of the extra cores talking to each other mostly cancel out the gains.
For many many small short-lifetime processes you'll probably do better.
~.~
I'm a peripheral visionary.
Where do you buy your salt water fish? I have a taste for some sea bass, broiled with olives and capers.
You are welcome on my lawn.
"a whopping 25MB of L3 cache" souds like "a whopping 20Mb capacity" for a brand new hard disk in the early 1990's...
Video compression. I have tested faster speed CPUs with fewer cores against slower speed with more cores - more cores won. I'm VERY interested in this but suspect the 8 core overclocked might be the fiscally responsible way to go. I'd use a XEON but you cannot overclock them...
I do have an ESX server but I've never been able to find a good compression appliance to use. Something I could use with say a web front-end to upload and just settings for ffmpeg would rock. Anyone?
Build it, Drive it, Improve it! Hybridz.org
Yes, but why?
Use MPI BETWEEN nodes and OpenMP WITHIN a node. MPI is always going to be slower on a single node than threads are and for HPC type loads it can easily be 100x slower to use processes instead of threads due to communications overhead.
Computer modeling for biotech drug manufacturing is HARD!
This will be nice to pop into a whitebox VMWare ESXi machine. Definitely cheaper than a 2 x 6 core build.
Interesting. I think Intel should do that kind of explaining.
Thanks for the explanation.
"(so Broadwell-E is 6000 like Skylake processors)"
That, to me, seems like Intel being typically Intel. That creates confusion, instead of communicating clearly.
A long time ago, I wanted to order some Intel motherboards. I needed the part numbers. It required 2 hours to get the numbers.
Several years ago, I mentioned an error in the Intel web site to an Intel customer service employee. He said, "Oh, we are re-doing our web site." A year later, I happened to get the same person on the phone. I mentioned the same error. He said the same thing, "Oh, we are re-doing our web site."
I was under the (admittedly vague) impression that was true only if the thread was using floating point.
No, that is the AMD Bulldozer design. Hyper-threading provides no extra CPU power for the additional threads, how much performance you get out of the extra threads depend on how much the threads are forced to stall due to memory access, if they all do integer/fp instructions and memory access that hits the cache, you get only 50% normal performance out of each hyper thread.
Let me rephrase... we do bulk builds with pourdriere of 20,000 applications. It takes a bit less than two days. We set the parallelism to roughly 2x the number of cpu threads available. There are usually several hundred processes active in various states at any given moment. The cpu load is pegged. Disk activity is zero for most of the time.
Have you ever considered using sar to check the amount of minor page faulting going on? It would be interesting to measure the activity between L3 cache and memory, it's possible that memory is thrashing as the CPU scheduler attempts to divide time between logical cores, that are actually the same physical core.
My applications are messaging systems so they aren't transient processes, like a compile. Over 20,000 applications that is a lot of context switching and from what you have described he amount of latency introduced by the CPU scheduler when it context switches would be high enough to consider tuning your your build process.
In our case there are always two or three ridiculously huge source files in the GCC build that the Make has to wait for before it can proceed with the link pass.
they're probably a good candidate for CPU affinity however I don't think you can isolate a cpu at runtime so unless you bump up it's priority the cpu scheduler can still kick it off L3 and the core it is using. The point I'm making here is if you dedicate a core to these bigger source files the efficiencies you gain may be enough to allow the link to proceed sooner, if that is important to you.
It would also follow that when your processes start generating I/O to write the objects prior to your link phase the processes would all be vying for IO attention and as each gets priority I am almost certain you would see very high levels of minor page faults as the scheduler tries to balance them. You might find that being *unfair* about which process gets access to CPU resources actually reduces your overall compile time.
So its a good idea to look at the footprint of the individual processes you are trying to parallelize, too.
Absolutely! Parallelism is great however you really have to know your processes and behaviour of the CPU and IO schedulers to get the most out of them.
Memory is cheap these days. Buy more. Even those tiny little BRIX one can get these days can hold 32G of ram. For a decent concurrent build on a decent cpu you want 8GB minimum, 16GB is better, or more.
Agreed! Even so it's so slow compared to CPU cache and the expense of throwing a process off a core when you do a context switch is worth investigation as you may still be able to yield good gains. Having more memory is great however having a bigger CPU cache is much better as the CPU scheduler has to context switch less often.
I'm actually more excited about the cache on these things than the cores as that has a greater potential for increasing system efficiency than just increasing the cores alone.
My ism, it's full of beliefs.
Since the inception of HT, is there a reason CPU design hasn't advanced to the point of executing 4 threads per core rather then the 2 it always has been? Is it an L3 cache limitation or diminishing laws or return in performance?
Life is not for the lazy.
This will kick in in Data Centers where you'll be able to run twice as many virtual machines per physical processor
Build a Man a Fire, and He'll Be Warm for a Day. Set a Man on Fire, and He'll Be Warm for the Rest of His Life.
A quick Google returns many people asking this question and others showing Make having a 20%-33% improvement with hyperthreading. That's a decent improvement.
Hyperthreading doesn't just benefit on memory stalls. Intel Haswell has 8 execution units and can retire 4 instructions per cycle. Any time there is a free execution unit and 4 instructions are not in flight, hyperthreading can schedule on the other thread. Intel put a lot of effort into increasing out of order execution performance, but not all work loads benefit a whole lot of OoO. This is where HT comes in. When a work load can not fully utilize OoO, much of the CPU core will be idle, and HT allows these idle parts to be utilized. Of course these situations are pathological and each thread has to split some shared resources, which reduces performance a bit. HT is very situation, but work great when it works.
Workload and system balance, mostly.
If you look back several years (2008? earlier?) you'll see some Sun Sparc designs, and some IBM POWER designs, that supported 4 or 8 threads per core. They worked well for very specific workloads and applications.
The Sun Sparc designs with 8 threads per core were mostly tailored for "simple" highly-scalable web servers, where a thread is blocking on I/O most of its time, and a web server could spawn many many threads to support many simultaneous connections. Worked very well for that purpose.
IBM did stuff like that with their POWER architecture for terminal servers and financial transaction processing, where, again, the thread spends most of its time blocking on I/O.
You don't get that so much for Intel x86/x64 systems, because, on the desktop side, frankly, most users don't use 4 cores well, and the few that do aren't doing I/O-blocking tasks, they are doing CPU-bound tasks, video encoding, stuff that hits the SIMD units hard. HT doesn't benefit nearly as well for CPU-bound tasks, and that market is small enough not to be worth the extra architecture/development time. For x64 servers, there is a bit more of a market there, but Intel would much rather serve that market with their high-end Xeon 4-socket systems. 10 cores per CPU, 4 CPUs, you get 40 cores and 80 threads. Oh, and you pay about $4,000 per CPU that way. That also gets you ridiculous amounts of RAM, and better networking support too. Usually you want both of those on your 80-thread server system, anyway.
So I suppose the answer is, basically, it has, but only where it's worthwhile.
Tim
This is my sig. There are many like it but this one is... Oops. Frank, I've got your sig again! Where's mine?
The down side of SMT is that you increase contention on various resources. Cache is the biggest one, as sharing the cache between 4 or 8 threads increases contention a lot. Things like the cache and TLB depend on associativity and you can't arbitrarily scale associativity and still get single-cycle access, so you're going to hit diminishing returns quite quickly. Most of these things also increase power consumption a lot as you try to scale them.
In particular, on a modern system, register renaming takes a lot of the die area, and the more thread you have, the more rename registers you need to get the full throughput.
I am TheRaven on Soylent News
As someone who does -j32 on our existing systems and sees almost linear speedup, I disagree. If you've got a decent amount of RAM, you won't hit the storage at all - everything will be in the buffer cache. Memory bandwidth is not an issue either - compiling generally has good locality of reference and so the cache works well.
I am TheRaven on Soylent News
Even those tiny little BRIX one can get these days can hold 32G of ram.
Do you have a link for what you are talking about? My desktop has 32 GB at home, and it has a pretty shitty MB, but I am not sure what you mean by a BRIX.
I found this on Google:
http://www.gigabyte.us/product...
But as far as I see, they only support 2 SoDIMM (DDR3L), which many of the specs pages list as 2x8GB max, so I don't know if this is what you are talking about.
APK likes to ask for responses to the same things over and over. Maybe he just likes the responses?
Is the CPU cooler able to keep up with the CPU doing that much work, or is the CPU forced to throttle back to prevent overheating.
APK likes to ask for responses to the same things over and over. Maybe he just likes the responses?
I would love a video conversion appliance. I want to pull movies off the Tivo and autoconvert them to MP4, as well as any movies in my collection in random video formats do the same. That would be very nice to have something that I could throw on the ESX box to do all that work with some minimal configuration.
APK likes to ask for responses to the same things over and over. Maybe he just likes the responses?
I generally buy them from liveaquaria.com. For my aquariums, not for my dinner. :)
I've fallen off your lawn, and I can't get up.
If we're talking about bulk builds, for any language, there is going to be a huge amount of locality of reference that matches well against caches. shared text RO, lots of shared files RO, stack use is localized (RW), process data is relatively localized (RW), and file writeouts are independent. Plus any decent scheduler will recognize the batch-like nature of the compile jobs and use relatively large switch ticks. For a bulk build the scheduler doesn't have to be very smart, it just needs to avoid moving processes around between cpus excessively so and be somewhat HW cache aware.
Data and stack will be different, but one nice thing about bulk builds is that there is a huge amount of sharing of the text (code) space. Here's an example of a bulk build relatively early in its cycle (so the C++ compiles aren't eating 1GB each like they do later in the cycle when the larger packages are being built):
http://apollo.backplane.com/DF...
Notice that nothing is blocked on storage accesses. The processes are either in a pure run state or are waiting for a child process to exit.
I've never come close to maxing out the memory BW on an Intel system, at least not with bulk builds. I have maxed out the memory BW on opteron systems but even there one still gets an incremental improvement with more cores.
The real bottleneck for something like the above is not the scheduler or the pegged cpus. The real bottleneck is the operating system which is having to deal with hundreds of fork/exec/run/exit sequences per second and often more than a million VM faults per second (across the whole system)... almost all on shared resources BTW, so it isn't an easy nut for the kernel to crack (think of what it means to the kernel to fork/exec/run/exit something like /bin/sh hundreds of times per second across many cpus all at the same time).
Another big issue for the kernel, for concurrent compiles, is the massive number of shared namecache resources which are getting hit all at once, particularly negative cache hits for files which don't exist (think about compiler include path searches).
These issues tend to trump basic memory BW issues. Memory bandwidth can become an issue, but it will mainly be with jobs which are more memory-centric (access memory more and do less processing / execute fewer instructions per memory access due to the nature of the job). Bulk compiles do not fit into that category.
-Matt
Replying to undo incorrect moderation. Sorry!
md5sum
d41d8cd98f00b204e9800998ecf8427e
"it could potentially trail behind the Core i7-6700K, a quad-core Skylake processor clocked at 3.4GHz (base) to 4GHz (Turbo)."
Not by much.
If you want to see the true speed of any CPU, look at the memory speed. Internal multipliers make some steps run faster but the overall effect isn't high enough to justify the cost deltas on the higher-clockrate CPUs. In general the sweetspot is 2-4 steps below the top step.
If you have a proper multitasking operating system it will take as much advantage of extra processors even if individual programs don't. For that reason I always bias toward more CPUs than higher clockrate when specifying servers for the datacentre, whilst aiming for maximum possible memory speed (That used to mean trying to keep to one bank on DDR3, but it's a bit easier with LRDIMMS and DDR4)
We don't run much virtualisation, as the kind of loads being run invariably max out the raw systems so there's no point, however in a virtualised environment "More CPUs" always beats "faster ones" as long as hyperthreading is disabled (if a virtualised box gets assigned a HT "CPU" then it will crawl).
Higher CPU clocking is mainly good for willy-waving, other than in quite specific tasks where you can keep everything important in L1/L2.
"Memory bandwidth can become an issue"
Bandwidth is almost never an issue. _Latency_ is another matter.
There are a lot of tricks and bits to optimise things regarding locality (mainly around row based and lookahead accessing. CPUs aren't the only devices trying to predict what will be read next) and controller optimisation, but the underlaying dynamic ram itself hasn't actually improved much over the last 20 years in terms of time between addressing a random cell and getting an answer back from it. The big improvements have been around the number of requests you can make while waiting for that answer instead of being in request-answer lockstep and there is only so far that can be taken.
_true_ 1GHz ram would have 1ns latency, not 12-30ns - and the reason that L1/L2/L3 is so important is because of that poor response time.
I hear it is used to power the new Gillette razor with 6 blades...
It is a complicated subject. Some tasks do not benefit from HT - those whose memory access fits entirely within cache, and who make use of operations that cannot be spread among execution units in a core (or where the pattern of operations is superscalar with a single thread).
Simultaneous multithreading (the non-trademark name for HT) offers benefits in certain situations. First, where the memory access pattern is unpredictable and/or uncachable - it essentially lets one thread keep the core working while the other thread waits on the memory access (this breaks down when the task is purely memory-bound - it can actually hurt performance due to cache fighting). The second is when the two threads use different execution units - one is doing mainly compares and branches, while the other is crunching on floating-point. This puts more of the core to work at once. The third is when the two threads use the same execution units, but the core has multiple copies (a Haswell core has like three integer ALUs).
I'm not surprised it failed with computational chemistry. That's about as memory-bound a task as you can find. You probably would see 0% performance improvement from doubling your actual, physical cores, unless you upgraded the memory controller alongside it.
I cited the anecdotal averages I've seen. Some tasks saw 200% speedup at long ago as Nehalem. Others are actually hurt by HT even under Haswell. It's a complicated feature and the variance is substantial.
"Memory bandwidth can become an issue"
Bandwidth is almost never an issue. _Latency_ is another matter.
Thanks stoatwblr, that's exactly what I was talking about.
There are a lot of tricks and bits to optimise things regarding locality (mainly around row based and lookahead accessing. CPUs aren't the only devices trying to predict what will be read next) and controller optimisation, but the underlaying dynamic ram itself hasn't actually improved much over the last 20 years in terms of time between addressing a random cell and getting an answer back from it.
Which is why I'm always trying to tune the amount of application latency so the CPU cycles can be used for actual work. Obviously everyone's workloads are different but it makes sense to provide a bit of a helping hand to the machine (usually so I can go home)
The big improvements have been around the number of requests you can make while waiting for that answer instead of being in request-answer lockstep and there is only so far that can be taken.
That's an interesting development I hadn't heard off. That would have a major impact on reducing application latency, I think I'll have to get my head around how it will affect CPU scheduler behaviour though.
In a similar vein, I've heard of new methods to move CPU scheduler functions to the cpu and arrange the on-chip memory so that it can load tasks physically closer to the actual core processing them so perhaps the two developments are somehow related. I appreciate you completing the picture of what's coming.
In the context of what I'm doing, I'm trying to get more work out of the machine.
_true_ 1GHz ram would have 1ns latency, not 12-30ns - and the reason that L1/L2/L3 is so important is because of that poor response time.
Precisely, if my calculations are correct, it's about the amount of time an instruction on a 1Ghz CPU would take. When I see a lot of context switching and minor page faulting my application is really exposed to that and the cpus are doing a whole lot of nothing useful waiting for L3 to write(mainly) to ram or to fill.
I've seen memory thrashing on productions systems, it's never pretty being in a PIR.
My ism, it's full of beliefs.