Intel Flagship Core i7-6950X Broadwell-E To Offer 10-Cores, 20-Threads, 25MB L3 (hothardware.com)

Software needs to catch up by iamacat · 2015-11-15 04:45 · Score: 1

Mainstream programming languages are still sequential by default and the likes of OpenCL are too hard to learn for simple tasks. UI code is still single threaded in most systems, and that drags most computation into that thread as well through programmer laziness. It's time for languages which are parallel by default and where ability to parallelize a loop is verifiable at compile time. Yes I know FORTRAN is much closer to that then C/Java, but that's due to being primitive to a degree that will not fly in 2015.

Re:Software needs to catch up by buchner.johannes · 2015-11-15 05:04 · Score: 1

Julia and Rust have some intriguing parallelisation mechanisms.
What I would like to write is code that has a dependency graph (or the compiler figures out the dependencies and parallelises by itself).
In the meantime I simple write code for 1 processor, and run that on many data sets in parallel (using make or doit).

--
NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
Re:Software needs to catch up by viperidaenz · 2015-11-15 06:21 · Score: 1

Java is pretty easy to write multi-threaded code.
Re:Software needs to catch up by TeknoHog · 2015-11-15 06:30 · Score: 1

Julia and Rust have some intriguing parallelisation mechanisms. [...] I simple write code for 1 processor, and run that on many data sets in parallel
I haven't found Julia's parallelism very efficient, but maybe it's just my lack of coding skills. Then again my work is rather parallel by nature (independent pixels), so I simply run several processes using shell scripts.
For example, it would be nice if something like map() were always parallelized as it kind of assumes independent data points, but there are still other considerations like memory management. Julia's pmap() seems to have too much overhead to be of any help, especially when the separate processes scale so well.

--
Escher was the first MC and Giger invented the HR department.
Re:Software needs to catch up by iamacat · 2015-11-15 06:30 · Score: 1

Multithreading requires every instance of concurrent execution to be micromanaged by the programmer, leading to a lot of code which is not parallelized in practice. Potential concurrency should be the default case for, say, all for loops and serialization an explicit paradigm that a programmer is aware of. Coupled with strong compile time checking that can detect safe and unsafe code.
Re:Software needs to catch up by Dutch+Gun · 2015-11-15 08:50 · Score: 1

It's not just the programming languages. Most *tasks* of any complexity tend to be highly sequential in nature. There are some rare exceptions, but the notion that a language can just automatically parallelize loops and get some massive speedup is not very feasible, I hate to say. It tends to work best in highly contrived or specialized situations. You have to be running some serious computation in a LOT of loops for that to pay off in any way, or the overhead is simply a non-starter. Moreover, those computations can't have any global side-effects or interaction with non-thread-safe data.
Your example, a UI library, is the antithesis of this, which is why they tend to be single-threaded. A UI system would need so many locks for all the interacting data that it would be largely pointless to try to make it multi-threaded. Every single property on every single widget would need to be atomic, and every method would need to be lockable or re-entrant. That's a hell of a design constraint, and the net result might be to make the performance significantly worse, not better.

--
Irony: Agile development has too much intertia to be abandoned now.
Re:Software needs to catch up by angel'o'sphere · 2015-11-15 09:02 · Score: 1

Basically every Java program I have ever seen is multithreaded. (that only excludes hello world programs etc.)
No idea why you explicitely mentioned Java and C# ... the first one is a particular bad example.

--
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
Re:Software needs to catch up by viperidaenz · 2015-11-15 09:56 · Score: 1

Here's an example from Oracle:
double average = roster .parallelStream() .filter(p -> p.getGender() == Person.Sex.MALE) .mapToInt(Person::getAge) .average() .getAsDouble();
Lamba's and extensions to the Collections framework have made parallel loops simple.
You can't have the compiler parallelize loops if the methods you call can be overridden. Forcing every method called to be final, just so you can optimize some loop is a little daft. It's also pointless (read: slower) parallelizing a loop that only runs a few iterations. It's something the programmer needs to think of and isn't always something the compiler can tell at compile time.
Re:Software needs to catch up by iamacat · 2015-11-15 15:05 · Score: 1

Nothing has to be made final, subclasses just need to obey the contract declared by superclass. This can be accomplished, in the worst case, by making everything synchronized.
Make default for loop potentially parallel and have compiler complain if it can not prove that by either code inspection or, as a last resort, explicit annotation on the loop or methods that it calls. Then introduce an sfor keyword for when you really have to make things sequential.
Re:Software needs to catch up by iamacat · 2015-11-15 15:11 · Score: 2

Really? Every browser window or tab should hang if Javascript in one of them is slow? Loading and decoding an image for one of the icons on screen should prevent the UI from processing touch events? Many of those problems have been solved by important applications ad-hoc, but sane behaviour by default would be great.
Re:Software needs to catch up by Dutch+Gun · 2015-11-15 20:31 · Score: 1

A web browser is somewhat of a unique case, as each tab is more or less equivalent to a separate application - or at least, it should be. Chrome certainly proved it can effectively be done that way, I think. I agree - a single page should not be able to slow down the entire browser - that's terrible design.
When I was talking about UI, I was talking more like .NET's WPF or Qt, perhaps, neither of which are thread-safe because of performance concerns. It's critical for the programmer to do the work of handling any potentially long-running tasks to avoid blocking the UI thread.
All I'm saying is that there isn't a magic bullet here. Multi-threaded programming simply requires careful design and implementation.

--
Irony: Agile development has too much intertia to be abandoned now.
Re:Software needs to catch up by Megol · 2015-11-16 02:18 · Score: 1

IMHO there should be 3 basic types of loops: FOR, PAR, SEQ. FOR loops can be parallelized if possible by the compiler, PAR loops have parallel semantics and SEQ loops have sequential semantics.
Then most loops can use the FOR variant but when advantageous the programmer can use PAR or SEQ depending on the situation and/or to help the compiler and improve readability.
Re:Software needs to catch up by viperidaenz · 2015-11-16 08:09 · Score: 1

Slapping synchronized on a method doesn't solve all your multi-threaded problems.
The simplest example is probably a Hashtable, all methods are synchronized.
If you want to replace a value in the map, calling get, then set is effectively a read-modify-write, that operation needs to be protected by enclosing it all in a synchronized block.
You'll also risk dead locks.
If Object A's synchronized method calls Object B's synchronized method and vise versa, having two threads calling both those methods risks a dead lock.
If your parallel loop is:
ObjectA.method();
ObjectB.method();
You're screwed. If doesn't need to be that simple, and if the methods aren't final, someone else can come along and easily introduce a race condition.
The only way to protect that code is wrap both calls in two synchronized blocks, acquiring first ObjectA's lock and then ObjectB's. That then makes your parallel loop serial, due to synchronization. Congratulations, you've used a bunch of memory and CPU creating new threads and a whole lot of lock contention and made your loop slower, while wasting resources on multiple CPU's
If you don't always acquire locks in the same order in all threads, you risk dead lock.
Synchronization is also very slow when there is contention.
Re:Software needs to catch up by Bengie · 2015-11-16 13:17 · Score: 1

My first non-homework program ever was a threaded in C# 2.0 and was for my first job. Threading is pretty easy. With .Net 4 and even more so with 4.5, they took a lot of the boring parts and let me simply connect my parts together is pre-made legos.

I can understand how threading can be hard for systems that are very latency sensitive like 3D games, but anything that just needs scaling and throughput, threading is brain dead easy.

Re: 20 cores DOES matter by iamacat · 2015-11-15 04:48 · Score: 1

How is your SSD speed and thermal envelope doing with make -j20?

Wrong specs on Skylake by SeeManRun · 2015-11-15 04:49 · Score: 4, Informative

I have been seeing this a lot lately for some reason. The i7 6700K runs at 4.0 ghz base clock and turbo's up to 4.2. So it will be quite a performance beatdown by Skylake if clockspeed instead of threads is important.

Re: Wrong specs on Skylake by Nemyst · 2015-11-15 05:16 · Score: 1

Yeah, which is pointless because comparing clock speeds between different manufacturers has never meant anything. Your A10 still tanks against most Intel processors in any but the most parallel of use cases. Oh, and you do need a separate video card if you actually care about having a GPU. APUs are still laughably underpowered.
Re: Wrong specs on Skylake by alvinrod · 2015-11-15 05:47 · Score: 1

The current AMD architecture has crummy IPC (instructions per clock) compared to Intel so you'd likely have to get it running close to 5.5 GHz for it to have competitive performance and the FP units are shared between each module's cores, which means for non-integer workloads you really have 6 cores.

However, for what AMD is selling it for, it's a better deal than the comparatively priced Intel chip, especially if you're doing anything that can make use of all of those cores.
Re: Wrong specs on Skylake by Blaskowicz · 2015-11-15 06:15 · Score: 1

Underpowered, but for most people that would be the most powerful GPU they ever had on a PC.
You need more for games but we're long past diminishing returns and with passing times you get to use uglier versions of Windows, also games are all tied to "app stores" like Steam and others so you can't sell or lend them and the store spies on you. Boring.
I can see myself upgrading to the next generation of APUs (better CPU, ddr4 support, supported by the new free driver + non-free blob architecture for linux) and after that there would be no need to upgrade ever.
Re: Wrong specs on Skylake by FlyHelicopters · 2015-11-15 11:45 · Score: 1

5.5? My 7850k replaced an Intel P4 @3.1. At 5Ghz I would need liquid nitrogen cooling. Not worth it imo. I'll take a hign end xeon over any skylake or other I7 flagship any day all day. Thanks for pushing my envelope and opening my eyes.
You missed the point...
A 2.4GHz Core2Duo would crush that P4 outright, even in single thread work.
A P4 955 EE running at 3.64GHz (dual core) is outright crushed by a 3.3Ghz Core2Duo, sometimes by a factor of 4.
Clock speed is just one measure of performance. A Skylake 6700K is, even at stock speeds, generally faster than anything AMD makes, even overclocked.
Re: Wrong specs on Skylake by One+With+Whisp · 2015-11-16 02:08 · Score: 1

Really, they track my cash-only purchases do they?

Re:20 cores DOES matter by ArchieBunker · 2015-11-15 04:52 · Score: 1

You can pick up hex core boxes dirt cheap on eBay now. Look for the X5650 or W3680 Xeon boxes.

--
Only the State obtains its revenue by coercion. - Murray Rothbard

Re:20 cores DOES matter by fyngyrz · 2015-11-15 05:02 · Score: 4, Interesting

I was under the (admittedly vague) impression that was true only if the thread was using floating point.

CPUs that offer more cores and/or threads than they do FPUs is one of the reasons I write a lot of my multi-threaded stuff (image and baseband RF processing) utilizing appropriately scaled integer math.

I have 8 cores with 8 FPUs on my desk, but many of my users are stuck with some of the wheezier I5 variants.

--
I've fallen off your lawn, and I can't get up.

Re:20 cores DOES matter by Anonymous Coward · 2015-11-15 05:03 · Score: 1

Both of which have such slow single threaded performance that most people are better off with an i3.

I have a W3680 box that runs 8 VDI boxes through citrix xendesktop and the performance is pretty good for that purpose but for gaming its 25-30% slower than my dual core i3 4160 in my cheapy shit gaming box

400 Thread Count by Anonymous Coward · 2015-11-15 05:05 · Score: 1

C'mon Intel, everyone knows that 400 thread count is the minimum needed for a good night's sleep.

Re:400 Thread Count by U2xhc2hkb3QgU3Vja3M · 2015-11-15 05:25 · Score: 1

Nope, 400 still isn't enough. A lot of us are still waiting for a CPU with over 9000 threads.
Re:400 Thread Count by KGIII · 2015-11-15 05:33 · Score: 1

Pfft... 1500 count Egyptian, or I'm going home!
(No, not really. I don't actually know what the thread count is. One, I'm in a hotel STILL and, two, I don't actually buy my own bedding at home.)

--
"So long and thanks for all the fish."
Re:400 Thread Count by U2xhc2hkb3QgU3Vja3M · 2015-11-15 12:14 · Score: 1

That wasn't a supercomputer reference, it was a DBZ reference.

Re: 20 cores DOES matter by Anonymous Coward · 2015-11-15 05:08 · Score: 1

Good question - it might end up clamped by other things, but at least up to -j 8, it scales decently well on a current CPU. It's still a big win over -j 4, say.

Compilation has a lot of non-local data access that pays a heavy price for cache misses, so it's not so much memory bandwidth sensitive as it seems. During those cache miss latency periods, other cores can still be doing something.

It might be that 20 is too much to hope for, I don't know. I'm pretty sure it'll scale beyond 8 though.

AMD's response? by Khyber · 2015-11-15 05:08 · Score: 3, Interesting

Assuming Intel doesn't go Xeon-scale in pricing for this CPU (who am I kidding, of course they will) I wonder how AMD plans to respond to this.

For now, they've got the consoles holding them afloat. And while I am an AMD fan, I see they are rapidly losing out on the desktop space when it comes to performance (despite both companies having rather meager performance gains for the past several years.)

They'd better figure out what the fuck they're doing, and come up with some competing responses, quickly. Hell, I've got ideas for them, all involving that HBM tech.

1. Use a modified version of that HBM tech to stack their CPU cores and load it up with tons of cache memory (for their non-APU line.) And don't forget to drop a process node, for fuck's sake.
2. Use modified HBM tech to create stacked CPU/GPU/RAM/CACHE on the same die (for their APU line.)
3. Use modified HBM to create stacked single-die CrossFire GPUs that don't consume gobs of power (GPU line.)
4. Use modified HBM tech to create a true monolithic SOC package that integrates EVERYTHING, thus eliminating the need for motherboards - at that point and time, it just becomes a breakout board with a socket. They could probably do away with the interposer as well if They were clever enough in the design.

--
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.

Re:AMD's response? by Nemyst · 2015-11-15 05:37 · Score: 4, Informative

HBM only works for stacking memory (hence why it's called High Bandwidth Memory). You can't stack CPU cores because they output waaaaaay too much heat. You can dissipate heat from memory passively, so stacking them and slapping an active cooler can work. Good luck stacking CPU cores in the same way.
Re:AMD's response? by Khyber · 2015-11-15 05:52 · Score: 1

"You can't stack CPU cores because they output waaaaaay too much heat."
Microfluidic cooling to an IHS. Easy-peasy.

--
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
Re:AMD's response? by Kjella · 2015-11-15 06:01 · Score: 1

Assuming Intel doesn't go Xeon-scale in pricing for this CPU (who am I kidding, of course they will) I wonder how AMD plans to respond to this.
Fighting the battles they can win, or are at least less likely to lose. This is a halo product of a server line of chips and getting Opterons back in the data center takes more time for validation and convincing conservative enterprises than AMD has. Zen will launch to compete with Intel's mainstream dual/quad-core chips, even if it pulls off a miracle I'm guessing it'd take at least a year or two until AMD is back to a full top-to-bottom stack.

--
Live today, because you never know what tomorrow brings
Re:AMD's response? by gman003 · 2015-11-15 06:06 · Score: 5, Informative

AMD has been developing a new microarchitecture, Zen, which will replace the horribly-designed Bulldozer. It's rumored to be made on a 14nm node, and they re-hired the guy who designed the K10 architecture (aka the last good CPUs AMD made), so I expect it to be reasonably competitive with Intel. I really hope it is, at least.
Your terminology is completely out of whack ("stacked single-die CrossFire GPU" is a phrase with more contradictions than whitespace characters), but I'll analyze what you were trying to say instead of what you actually said:
#1: Current chip-stacking tech doesn't allow for all that much bandwidth between chips, especially when going above two layers. CPU cores need a pretty hefty amount of bandwidth to their cache, so that's already problematic. Stacking dies also limits thermal performance - if you stack two dies, you have 2x the heat in 1x the heat-conducting surface area. For low-power stuff, that's fine, but CPU cores get pretty hot. Many high-performance dies are already performance-constrained by how much heat they can conduct to their cooler.
#2. This is a good idea. Or rather, the good idea is "APU on an interposer using HBM for main memory". You'd need bigger CPU caches - HBM is ridiculously high-latency even by VRAM standards, it will really hurt CPU performance otherwise. And it will limit upgradability - no way to just pop another DIMM of DDR3 in there. But the GPU gains should be worth it.
#3. Again, thermals will absolutely prevent you from stacking GPU dies. HBM and stacking doesn't do ANYTHING for the power efficiency of the chips you're stacking, so that's two 100W+ dies on top of each other. Not gonna happen. You could stack them side-by-side on an interposer, but at that point why not just fabricate them as one die?
#4. The cost of an interposer is significantly greater than that of a printed circuit board, and a lot of stuff won't benefit from the greater bandwidth to the CPU - stuff like a USB controller or audio chipset. Stacking the dies is also more expensive than just using a PCB - it's done in phones where space is REALLY constrained, but even the smallest desktops aren't that tight for space yet. So all that's left is putting everything onto one die - which runs into yield problems, because with bigger individual dies, a single defect will wipe out a lot more silicon. AMD actually *is* already doing this with their lowest-end laptop/desktop parts - look at Socket AM1, there's not much on the motherboard besides external connectors and power-delivery circuits. But they're also pretty low-end in performance.
Re:AMD's response? by Anonymous Coward · 2015-11-15 06:15 · Score: 1

And how many microfluidic systems are available on the consumer market? Oh right, None of importance. IBM is doing it's damnedest with Microfluidics, and even Big Blue has problems making Microfluidics in bulk. Not to mention that attempting it would massively cut initial yields, because you're combining multiple bleeding edge pieces of tech, and hoping that you can contract that out to someone else to make it. Is it workable, yes. But could you make a million of them tomorrow, or even next year? Nope. AMD had trouble with traditional liquid cooling, even when using 3rd party supplied cooling systems. What exactly makes you think any of them are qualified to design Microfluidics, or that they could create good masks for it? Not to mention that a Process shrink is itself problematic. And at the lowests process sizes, the transistors themselves must become 3D (Finfet), making it even harder to stack them, while admittedly improving cooling in theory, as more surface area is available.
But AMD and global foundries have not had too many big advances recently of their own. You're asking for no less than 3 major innovations in a single chip, and mass manufacturing of same.
Re:AMD's response? by gman003 · 2015-11-15 06:21 · Score: 3, Insightful

Being able to describe something in five words does not make it easy.
Re:AMD's response? by Khyber · 2015-11-15 08:58 · Score: 1

It's that easy as we're doing it to make large linear high-power LEDs run very cool.
Protip for AMD: Start with a thicker wafer, do the underside with your required cooling channels, do the topside with your typical litho. You CAN wick heat from the underside.

--
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
Re:AMD's response? by Khyber · 2015-11-15 09:18 · Score: 2

We have microfluidics for stacking dies and removing heat. We do it on p-n junctions on some of the latest LEDs (which are fucking MASSIVE at nearly 7mm x 7mm on just the die alone, not including any mount, circuitry, etc.) to keep them very cool.
I don't speak of ideas unless I already know we've got the technology to handle it.

--
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
Re:AMD's response? by gman003 · 2015-11-15 09:32 · Score: 1

49mm^2 is "massive"? A high-end processor is 500-600mm^2. And even if microfluidics works to remove heat (how do you have a layer with both enough fluid channels to cool, and enough TSVs for communication?), that will increase your cost substantially. I would expect $1K+ for a quad-core CPU under this kind of design.
Re:AMD's response? by dbIII · 2015-11-15 15:56 · Score: 1

Low end desktops and high end cluster computing are keeping AMD going. It's not as if you can put four of these 3GHz 20 thread beasts on a board, and the Xeons that can do that are both slower and cost a fortune compared with four-way AMD CPUs.
Not very long ago I got a quote for an 80 thread Intel machine (4 Xeons) which turned out to be slightly less than ten times the price for a 64 core AMD machine with the same clock speed, memory capacity, disks etc. The Xeon machine would perform better than a single AMD one, but not better than nine of them! That's an example of why AMD are still selling a lot of gear at high performance computing end of town.

Re:20 cores DOES matter by Anonymous Coward · 2015-11-15 05:16 · Score: 1

As a side benefit, "appropriately scaled integer math" can be better in other ways too, depending on the nature of your problem. It doesn't pile all its precision up right near zero, but gives you an even distribution of the precision across the range. That might (or might not) be a much better fit for your problem, depending on what you're doing.

Re:20 cores DOES matter by beelsebob · 2015-11-15 05:17 · Score: 2

Apparently the last time you checked was in 2004. Hyper threading gets you a lot more than a 10% gain on modern CPUs.

Into the Wayback Machine Sherman! by Chas · 2015-11-15 05:19 · Score: 1

Imagine a Beowulf Cluster of these!

--

Chas - The one, the only.
THANK GOD!!!

Re:Into the Wayback Machine Sherman! by BarbaraHudson · 2015-11-15 05:30 · Score: 1

Imagine a Beowulf Cluster of these!
No need ... give it a while and people will be saying "64 Cores ought to be enough for anyone." Then again, GPUs passed that count long ago.

--
"Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.
Re:Into the Wayback Machine Sherman! by BarbaraHudson · 2015-11-15 16:33 · Score: 1

The nVidia Tesla CUDA GPU cards have thousands if core

Q: What is NVIDIA Tesla?
With the world’s first teraflop many-core processor, NVIDIA® Tesla computing solutions enable the necessary transition to energy efficient parallel computing power. With thousands of CUDA cores per processor , Tesla scales to solve the world’s most important computing challenges—quickly and accurately.
One example Tesla K40: 2880 CUDA cores. That's a LOT of cores.

--
"Transparent" is a shit show that trades on every stereotype going. A man in drag is NOT a transsexual.

Re: Sounds like fun by Anonymous Coward · 2015-11-15 05:19 · Score: 1

Thank you for sharing. I was lost without your insightful comment!

Next step: Send consultants to MySQL. by SuricouRaven · 2015-11-15 05:20 · Score: 2

"Hi. We're from Intel, and we'd like to take a look at your multithreading, such as it is."

Re:Next step: Send consultants to MySQL. by TFlan91 · 2015-11-15 09:27 · Score: 1

Waste of time...
Send them to pg, make it even better
Re:Next step: Send consultants to MySQL. by Anonymous Coward · 2015-11-15 10:20 · Score: 1

"Sorry, folks from Intel. We can't accept your help even for free because we would then be more competitive with the products of our parent company."

Re:20 cores DOES matter by U2xhc2hkb3QgU3Vja3M · 2015-11-15 05:24 · Score: 1

Me... I'll have to limp along on my existing 2 core 4 thread CPU.

You think your CPU isn't good enough? I'm still using a Core 2 Duo here!

Memory isn't neccesarily the bottleneck. by fyngyrz · 2015-11-15 05:25 · Score: 1

It's not that straight-forward. Depends on how much time the threads spend in the cache, and how much time they spend waiting on the FPU.

--
I've fallen off your lawn, and I can't get up.

Re:20 cores DOES matter by fyngyrz · 2015-11-15 05:31 · Score: 2

Yep. Although I have to say, I really enjoy working with the domain 0.0-1.0; there are so many neat tricks that can be pulled. You can do them in integer too, but there is hoop-jumping involved.

--
I've fallen off your lawn, and I can't get up.

Re:What is a by Anonymous Coward · 2015-11-15 05:32 · Score: 1

20-thread is more normally seen as "20 TPI" or thread per inch. It is a standard of thread pitch for bolts.https://en.wikipedia.org/wiki/Screw_thread#Lead.2C_pitch.2C_and_starts

This is all well and good... by Type44Q · 2015-11-15 05:39 · Score: 1

This is all well and good but I have to wonder: is this thing still optimized for single-threaded performance??

Nice, but... by Brad1138 · 2015-11-15 05:52 · Score: 1

How much does it really matter anymore? For 99% of the population, any top end computer built in the last 5+ years is so damn fast, it will be fine for the next 10 years. Unless we see the 100x (or whatever) increase with quantum computing, these small incremental improvements are fairly pointless.

--
If you could reason with religious people, there would be no religious people

Re:Nice, but... by jeffb+(2.718) · 2015-11-15 06:14 · Score: 3, Funny

If you don't have at least 10 cores, how can you expect to run the ads, tracking software and gratuitous animations required to fully participate in the online society of the late 20-teens?
Re:Nice, but... by balbus000 · 2015-11-17 03:25 · Score: 1

Wow, what kind of utopian future are you predicting where the companies who write tracking software properly support multithreading? ;)

Really by wwalker · 2015-11-15 05:54 · Score: 1

Is it really 10 cores / 20 threads for real? Or will it reduce the core frequency even further by like 50% if you try to use all of them at once for more than a second, to say within TDP envelope and not to melt?

Re:Really by Anonymous Coward · 2015-11-15 09:36 · Score: 1

Wrong socket. LGA2011 v3, not 1150.

Re: 20 cores DOES matter by m.dillon · 2015-11-15 06:02 · Score: 4, Interesting

Actually, parallel builds barely touch the storage subsystem. Everything is basically cached in ram and writes to files wind up being aggregated into relatively small bursts. So the drives are generally almost entirely idle the whole time.

It's almost a pure-cpu exercise and also does a pretty good job testing concurrency within the kernel due to the fork/exec/run/exit load (particularly for Makefile-based builds which use /bin/sh a lot). I've seen fork/exec rates in excess of 5000 forks/sec during poudriere runs, for example.

-Matt

Re:20 cores DOES matter by gman003 · 2015-11-15 06:09 · Score: 3, Informative

You likely have not checked for a while. I saw figures of 120% performance ("each core at 60% performance" as you put it) back under the Pentium 4 HT, 140% under Nehalem/Sandy Bridge, and 150% under Haswell.

Re:Seriously? by Opportunist · 2015-11-15 06:12 · Score: 1

Don't worry. Windows will bloat to fill the additional cores.

--
We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.

Re:20 cores DOES matter by m.dillon · 2015-11-15 06:16 · Score: 4, Informative

Hyperthreading on intel gives about a +30 to +50% performance improvement. So each core winds up being about 1.3 to 1.5 times the performance with two threads verses 1.0 with one. Quite significant. It depends on the type of load, of course.

The main reason for the improvement is of course due to one thread being able to make good use of execution units while the other thread is stalled on something (like memory or TLB, significant integer shifts, or dependent Integer or FPU multiply and divide operations).

-Matt

Intel often communicates poorly. by Futurepower(R) · 2015-11-15 06:20 · Score: 1

Why is Intel introducing a new Broadwell processor? Why not Skylake?

Broadwell was a "Tick". Skylake is the improvement called "Tock".

Re: Intel often communicates poorly. by Fwipp · 2015-11-15 06:35 · Score: 3, Interesting

"Intel has made a habit of launching enthusiast versions of previous generations processors after it releases it a new architecture."
Re:Intel often communicates poorly. by Nemyst · 2015-11-15 11:22 · Score: 1

Yields. When intel releases a new processor line, yields are still pretty low, especially towards the high end. That's why you have binning and so many different processors - so they can recycle a top-end processor as a mid or high-end processor should parts of it end up subpar (though this is more popular in GPUs these days).

As the line ages, yields improve and they generally iterate over the design in smaller ways to obtain even better efficiency or iron out issues. It's at that point that it becomes very appealing to release an improved flagship series (the -E branding) using the better yields. They're branded the same as the current line (so Broadwell-E is 6000 like Skylake processors) because they're still considered superior.

Re:20 cores DOES matter by SolarAxix · 2015-11-15 06:43 · Score: 1

If you run at stock speed, you are right.

If you have the right motherboard and the right CPU, you can get some pretty good overclock on these CPUs (if you are into overclocking). I've got a few X5650s setup at home. I can get 4.2 GHz on one and 4.6 GHz @1.34 volts on the other 24/7. With 6 cores (12 cores /w HT), water cooling is almost a requirement at these speeds (unless you are really lucky with the CPU). Max temps for both is ~75 C in a 20 C room. I got each CPUs on eBay for $60 a piece and I already had the MBs.

These days the most expensive part could be the X58 motherboard unless you already have one. Some of them are selling for more used than their new MSRP. One drawback is most X58 boards don't have USB3 or SATA3. You could always get PCIe cards for those, but the bill of material (BOM) would increase.

If anyone is interested, there's an Anandtech forum thread that has a lot of information on it: http://forums.anandtech.com/sh...

Awesome by Early+Six+Digit+UID · 2015-11-15 07:04 · Score: 1

Not only will this thing be great for those of us who love to run PovRay and virtual machines, but hopefully we're well on our way to software designers actually getting comfortable with parallelizing workloads.

It seems like most major software packages finally support 64-bit processing (shaking my head at ArcMap), and even video games are being dragged along by the new wave of consoles. Maybe they'll be more agile in the future.

Do modern, development-focused CS degree programs talk about multiprocessing?

169784

Re:Awesome by dbIII · 2015-11-15 16:03 · Score: 1

Do modern, development-focused CS degree programs talk about multiprocessing?
They did in 1990. Oh wait, that subject was run by an electrical engineering department so it may take a while for others to catch up.
Re:Awesome by gl4ss · 2015-11-15 19:37 · Score: 1

you would think that they do since it's essential to making anything in java or making anything in android or making anything in just about everything except javascript nowadays.
anyhow.. our multi thread parallel programming course was in ADA.. it was in ADA so that we could use the language built in mechanisms and skip learning anything. I guess there was some rationale but it was probably the same rationale that was used for learning microcontroller programming with assembly for some hitachi pos.

--
world was created 5 seconds before this post as it is.

Re: 20 cores DOES matter by m.dillon · 2015-11-15 07:14 · Score: 4, Informative

Urm. And you've investigated this and found that your drive is pegged because? Of What? Or you haven't investigated this and you have no idea why your drive is pegged. I'll take a guess... you are running out of memory and the disk activity you see is heavy paging.

Let me rephrase... we do bulk builds with pourdriere of 20,000 applications. It takes a bit less than two days. We set the parallelism to roughly 2x the number of cpu threads available. There are usually several hundred processes active in various states at any given moment. The cpu load is pegged. Disk activity is zero for most of the time.

If I do something less strenuous, like a buildworld or buildkernel, almost the same result. Cpu is mostly pegged, disk activity is zero for the roughly 30 minutes the buildworld takes. However, smaller builds such as a buildworld or buildkernel, or a linux kernel build, regardless of the -j concurrency you specify, will certainly have bottlenecks in the build subsystem that have nothing to do with the cpu. A little work on the Makefiles will solve that problem. In our case there are always two or three ridiculously huge source files in the GCC build that the Make has to wait for before it can proceed with the link pass. Similarly with a kernel build there is a make depend step at the beginning which is not parallelized and the final link at the end which cannot be parallelized which actually take most of the time. Compiling the sources in the middle finishes in a flash.

But your problem sounds a bit different... kinda sounds like you are running yourself out of memory. Parallel builds can run machines out of memory if the dev specifies more concurrency than his memory can handle. For example, when building packages there are many C++ source files which #include the kitchen sink and wind up with process run sizes north of 1GB. If someone only has 8GB of ram and tries a -j 8 build under those circumstances, that person will run out of memory and start to page heavily.

So its a good idea to look at the footprint of the individual processes you are trying to parallelize, too.

Memory is cheap these days. Buy more. Even those tiny little BRIX one can get these days can hold 32G of ram. For a decent concurrent build on a decent cpu you want 8GB minimum, 16GB is better, or more.

-Matt

Re: 20 cores DOES matter by JoeyRox · 2015-11-15 07:33 · Score: 1

NVME SSDs can do 100k random IOPs. That's an I/O completing every 10us. That's plenty fast to keep a 20-thread compiling pipeline busy with data.

Re:20 cores DOES matter by ArchieBunker · 2015-11-15 07:51 · Score: 1

Interesting link. My Lenovo box has the an X58 motherboard but I don't think the stock bios will allow for much tweaking. This was still a huge speed boost from my old Q6600 box.

--
Only the State obtains its revenue by coercion. - Murray Rothbard

Re:20 cores DOES matter by nadaou · 2015-11-15 08:00 · Score: 3, Interesting

It depends on the task. For double precision FP calculations using MPI multi-processing (e.g. FORTRAN CFD), the extra overhead of the extra cores talking to each other mostly cancel out the gains.

For many many small short-lifetime processes you'll probably do better.

--
~.~
I'm a peripheral visionary.

Re: 20 cores DOES matter by PopeRatzo · 2015-11-15 08:16 · Score: 1

I do have my salt water fish shipped to me in thermal envelopes.

Where do you buy your salt water fish? I have a taste for some sea bass, broiled with olives and capers.

--
You are welcome on my lawn.

"a whopping 25MB of L3 cache"(!) by Anonymous Coward · 2015-11-15 08:31 · Score: 1

"a whopping 25MB of L3 cache" souds like "a whopping 20Mb capacity" for a brand new hard disk in the early 1990's...

Re:"a whopping 25MB of L3 cache"(!) by MrKaos · 2015-11-16 00:34 · Score: 1

"a whopping 25MB of L3 cache" souds like "a whopping 20Mb capacity" for a brand new hard disk in the early 1990's...
*Whooosh*

--
My ism, it's full of beliefs.

Re:20 cores DOES matter by BLKMGK · 2015-11-15 09:15 · Score: 1

Video compression. I have tested faster speed CPUs with fewer cores against slower speed with more cores - more cores won. I'm VERY interested in this but suspect the 8 core overclocked might be the fiscally responsible way to go. I'd use a XEON but you cannot overclock them...

I do have an ESX server but I've never been able to find a good compression appliance to use. Something I could use with say a web front-end to upload and just settings for ffmpeg would rock. Anyone?

--
Build it, Drive it, Improve it! Hybridz.org

What does that say about Intel? by Futurepower(R) · 2015-11-15 09:23 · Score: 1

Yes, but why?

Re:What does that say about Intel? by Kjella · 2015-11-15 10:30 · Score: 1

Yes, but why?
The enthusiast desktop CPUs are a spin-off from the Xeon server CPUs. They spend much longer time validating those than mainstream laptop/desktop CPUs, so on any new architecture/process they're likely to arrive last. The upside and/or downside is that it might still introduce new features or standards like say DDR4 ahead of the consumer CPUs, but sometimes at a high cost.

--
Live today, because you never know what tomorrow brings

Re:20 cores DOES matter by Ambassador+Kosh · 2015-11-15 09:30 · Score: 1

Use MPI BETWEEN nodes and OpenMP WITHIN a node. MPI is always going to be slower on a single node than threads are and for HPC type loads it can easily be 100x slower to use processes instead of threads due to communications overhead.

--
Computer modeling for biotech drug manufacturing is HARD! :)

VMWare whitebox heaven by barc0001 · 2015-11-15 10:21 · Score: 2

This will be nice to pop into a whitebox VMWare ESXi machine. Definitely cheaper than a 2 x 6 core build.

Re:VMWare whitebox heaven by swb · 2015-11-15 11:29 · Score: 1

If only they would pair it with a desktop board that could take 256 GB RAM.
I find that I eat all my disk i/o and RAM way before my cpu.
Re:VMWare whitebox heaven by barc0001 · 2015-11-15 13:55 · Score: 1

Yeah that's part of the problem, but for some of our dev workloads we only use 2GB of RAM per VM but hammer the processor so this is a good niche fit. And Gigabyte's got some workstation boards that go to 64GB but also cost more so it's a trade off - and it's not a sure thing they'll support these chips. Obviously it's not for everyone.
Re:VMWare whitebox heaven by Blaskowicz · 2015-11-16 04:13 · Score: 1

There are desktop boards that support a theoretical 512GB or 768GB memory, if you go with registered ddr4.
Look for the "pro" chipset, C612.
Needs a Xeon E5-1xxx - the leading 1 says it works only in single CPU mode - which is about the same as an i7 anyway.

Interesting. by Futurepower(R) · 2015-11-15 14:03 · Score: 1

Interesting. I think Intel should do that kind of explaining.

Typical Intel confusion by Futurepower(R) · 2015-11-15 14:16 · Score: 1

Thanks for the explanation.

"(so Broadwell-E is 6000 like Skylake processors)"

That, to me, seems like Intel being typically Intel. That creates confusion, instead of communicating clearly.

A long time ago, I wanted to order some Intel motherboards. I needed the part numbers. It required 2 hours to get the numbers.

Several years ago, I mentioned an error in the Intel web site to an Intel customer service employee. He said, "Oh, we are re-doing our web site." A year later, I happened to get the same person on the phone. I mentioned the same error. He said the same thing, "Oh, we are re-doing our web site."

Re:20 cores DOES matter by Carewolf · 2015-11-15 23:18 · Score: 1

I was under the (admittedly vague) impression that was true only if the thread was using floating point.

No, that is the AMD Bulldozer design. Hyper-threading provides no extra CPU power for the additional threads, how much performance you get out of the extra threads depend on how much the threads are forced to stall due to memory access, if they all do integer/fp instructions and memory access that hits the cache, you get only 50% normal performance out of each hyper thread.

Re: 20 cores DOES matter by MrKaos · 2015-11-16 00:30 · Score: 1

Let me rephrase... we do bulk builds with pourdriere of 20,000 applications. It takes a bit less than two days. We set the parallelism to roughly 2x the number of cpu threads available. There are usually several hundred processes active in various states at any given moment. The cpu load is pegged. Disk activity is zero for most of the time.

Have you ever considered using sar to check the amount of minor page faulting going on? It would be interesting to measure the activity between L3 cache and memory, it's possible that memory is thrashing as the CPU scheduler attempts to divide time between logical cores, that are actually the same physical core.

My applications are messaging systems so they aren't transient processes, like a compile. Over 20,000 applications that is a lot of context switching and from what you have described he amount of latency introduced by the CPU scheduler when it context switches would be high enough to consider tuning your your build process.

In our case there are always two or three ridiculously huge source files in the GCC build that the Make has to wait for before it can proceed with the link pass.

they're probably a good candidate for CPU affinity however I don't think you can isolate a cpu at runtime so unless you bump up it's priority the cpu scheduler can still kick it off L3 and the core it is using. The point I'm making here is if you dedicate a core to these bigger source files the efficiencies you gain may be enough to allow the link to proceed sooner, if that is important to you.

It would also follow that when your processes start generating I/O to write the objects prior to your link phase the processes would all be vying for IO attention and as each gets priority I am almost certain you would see very high levels of minor page faults as the scheduler tries to balance them. You might find that being *unfair* about which process gets access to CPU resources actually reduces your overall compile time.

So its a good idea to look at the footprint of the individual processes you are trying to parallelize, too.

Absolutely! Parallelism is great however you really have to know your processes and behaviour of the CPU and IO schedulers to get the most out of them.

Memory is cheap these days. Buy more. Even those tiny little BRIX one can get these days can hold 32G of ram. For a decent concurrent build on a decent cpu you want 8GB minimum, 16GB is better, or more.

Agreed! Even so it's so slow compared to CPU cache and the expense of throwing a process off a core when you do a context switch is worth investigation as you may still be able to yield good gains. Having more memory is great however having a bigger CPU cache is much better as the CPU scheduler has to context switch less often.

I'm actually more excited about the cache on these things than the cores as that has a greater potential for increasing system efficiency than just increasing the cores alone.

--
My ism, it's full of beliefs.

Re:20 cores DOES matter by DigiShaman · 2015-11-16 00:46 · Score: 1

Since the inception of HT, is there a reason CPU design hasn't advanced to the point of executing 4 threads per core rather then the 2 it always has been? Is it an L3 cache limitation or diminishing laws or return in performance?

--
Life is not for the lazy.

Re:20 cores DOES matter by Big+Hairy+Ian · 2015-11-16 00:58 · Score: 1

This will kick in in Data Centers where you'll be able to run twice as many virtual machines per physical processor

--

Build a Man a Fire, and He'll Be Warm for a Day. Set a Man on Fire, and He'll Be Warm for the Rest of His Life.

Re:20 cores DOES matter by Bengie · 2015-11-16 01:20 · Score: 1

A quick Google returns many people asking this question and others showing Make having a 20%-33% improvement with hyperthreading. That's a decent improvement.

Re:20 cores DOES matter by Bengie · 2015-11-16 01:40 · Score: 1

Hyperthreading doesn't just benefit on memory stalls. Intel Haswell has 8 execution units and can retire 4 instructions per cycle. Any time there is a free execution unit and 4 instructions are not in flight, hyperthreading can schedule on the other thread. Intel put a lot of effort into increasing out of order execution performance, but not all work loads benefit a whole lot of OoO. This is where HT comes in. When a work load can not fully utilize OoO, much of the CPU core will be idle, and HT allows these idle parts to be utilized. Of course these situations are pathological and each thread has to split some shared resources, which reduces performance a bit. HT is very situation, but work great when it works.

Re:20 cores DOES matter by TFloore · 2015-11-16 01:40 · Score: 1

Since the inception of HT, is there a reason CPU design hasn't advanced to the point of executing 4 threads per core rather then the 2 it always has been?

Workload and system balance, mostly.

If you look back several years (2008? earlier?) you'll see some Sun Sparc designs, and some IBM POWER designs, that supported 4 or 8 threads per core. They worked well for very specific workloads and applications.

The Sun Sparc designs with 8 threads per core were mostly tailored for "simple" highly-scalable web servers, where a thread is blocking on I/O most of its time, and a web server could spawn many many threads to support many simultaneous connections. Worked very well for that purpose.

IBM did stuff like that with their POWER architecture for terminal servers and financial transaction processing, where, again, the thread spends most of its time blocking on I/O.

You don't get that so much for Intel x86/x64 systems, because, on the desktop side, frankly, most users don't use 4 cores well, and the few that do aren't doing I/O-blocking tasks, they are doing CPU-bound tasks, video encoding, stuff that hits the SIMD units hard. HT doesn't benefit nearly as well for CPU-bound tasks, and that market is small enough not to be worth the extra architecture/development time. For x64 servers, there is a bit more of a market there, but Intel would much rather serve that market with their high-end Xeon 4-socket systems. 10 cores per CPU, 4 CPUs, you get 40 cores and 80 threads. Oh, and you pay about $4,000 per CPU that way. That also gets you ridiculous amounts of RAM, and better networking support too. Usually you want both of those on your 80-thread server system, anyway.

So I suppose the answer is, basically, it has, but only where it's worthwhile.

Tim

--
This is my sig. There are many like it but this one is... Oops. Frank, I've got your sig again! Where's mine?

Re:20 cores DOES matter by TheRaven64 · 2015-11-16 02:26 · Score: 1

SMT addresses two problems. The first is that, when a thread is completely stalled waiting for memory, you can try to run another thread. This doesn't happen that often with superscalar out-of-order cores, as you can also schedule other instructions from the same thread if there are no dependencies. The second is that a fetch granule from one thread may not be able to issue instructions to all of the execution units in a single cycle, giving you more to choose from.

The down side of SMT is that you increase contention on various resources. Cache is the biggest one, as sharing the cache between 4 or 8 threads increases contention a lot. Things like the cache and TLB depend on associativity and you can't arbitrarily scale associativity and still get single-cycle access, so you're going to hit diminishing returns quite quickly. Most of these things also increase power consumption a lot as you try to scale them.

In particular, on a modern system, register renaming takes a lot of the die area, and the more thread you have, the more rename registers you need to get the full throughput.

--
I am TheRaven on Soylent News

Re:20 cores DOES matter by TheRaven64 · 2015-11-16 02:29 · Score: 1

As someone who does -j32 on our existing systems and sees almost linear speedup, I disagree. If you've got a decent amount of RAM, you won't hit the storage at all - everything will be in the buffer cache. Memory bandwidth is not an issue either - compiling generally has good locality of reference and so the cache works well.

--
I am TheRaven on Soylent News

Re: 20 cores DOES matter by Coren22 · 2015-11-16 03:37 · Score: 1

Even those tiny little BRIX one can get these days can hold 32G of ram.

Do you have a link for what you are talking about? My desktop has 32 GB at home, and it has a pretty shitty MB, but I am not sure what you mean by a BRIX.

I found this on Google:
http://www.gigabyte.us/product...

But as far as I see, they only support 2 SoDIMM (DDR3L), which many of the specs pages list as 2x8GB max, so I don't know if this is what you are talking about.

--
APK likes to ask for responses to the same things over and over. Maybe he just likes the responses?

Re: 20 cores DOES matter by Coren22 · 2015-11-16 03:40 · Score: 1

Is the CPU cooler able to keep up with the CPU doing that much work, or is the CPU forced to throttle back to prevent overheating.

--
APK likes to ask for responses to the same things over and over. Maybe he just likes the responses?

Re:20 cores DOES matter by Coren22 · 2015-11-16 03:50 · Score: 1

I would love a video conversion appliance. I want to pull movies off the Tivo and autoconvert them to MP4, as well as any movies in my collection in random video formats do the same. That would be very nice to have something that I could throw on the ESX box to do all that work with some minimal configuration.

--
APK likes to ask for responses to the same things over and over. Maybe he just likes the responses?

Something fishy about this by fyngyrz · 2015-11-16 05:14 · Score: 1

I generally buy them from liveaquaria.com. For my aquariums, not for my dinner. :)

--
I've fallen off your lawn, and I can't get up.

Re:Something fishy about this by PopeRatzo · 2015-11-16 05:35 · Score: 1

I generally buy them from liveaquaria.com. For my aquariums, not for my dinner. :)
And they show up alive? I'm always a little amazed when people get live animals shipped to them. I have a friend who's an urban beekeeper who gets live bees fed-exed to them. What a country.

--
You are welcome on my lawn.
Re:Something fishy about this by Muad'Dave · 2015-11-16 05:54 · Score: 1

You can get day-old chicks thru the post office. Your friend's bees are listed at the top of that same page.

--
Tiller's Rule: Never use a word in written form that you've only heard and never read. You will end up looking foolish.
Re:Something fishy about this by fyngyrz · 2015-11-16 11:06 · Score: 1

Yes, they show up alive and doing quite well. They pack them in plastic bags, some of which have black light-shields for the species that are prone to shock from sudden changes in light intensity, all inside said thermal envelope (a Styrofoam cooler, essentially.) They put a heating or cooling chemical packet in there with them, depending on the season, and then ship them overnight by FedEx or UPS. I unpack them immediately upon receipt, gradually acclimate them to the water and temperature they'll be living in, then pop them in the tanks. They have always survived, except when something else in there decides they'd make a good meal, or the corals get into a chemical war, which can be pretty severe and requires re-homing one of the species so engaged.
Where I live (rural Montana), this is about the only practical way to build and maintain a saltwater aquarium. I don't mind. The selection online for live rock, plants, corals, fish and invertebrates -- and gear -- is better than one could ever hope for in a storefront operation. It's a little dear, cost-wise, but I'm old and have a few bucks I can apply to my interests. My oldest tank has been running for about a decade; my most recent, an unusual configuration with a custom sump system I designed located well above the tank, about six months.

--
I've fallen off your lawn, and I can't get up.
Re:Something fishy about this by PopeRatzo · 2015-11-16 11:30 · Score: 1

Very cool.

--
You are welcome on my lawn.

Re: 20 cores DOES matter by m.dillon · 2015-11-16 05:49 · Score: 1

If we're talking about bulk builds, for any language, there is going to be a huge amount of locality of reference that matches well against caches. shared text RO, lots of shared files RO, stack use is localized (RW), process data is relatively localized (RW), and file writeouts are independent. Plus any decent scheduler will recognize the batch-like nature of the compile jobs and use relatively large switch ticks. For a bulk build the scheduler doesn't have to be very smart, it just needs to avoid moving processes around between cpus excessively so and be somewhat HW cache aware.

Data and stack will be different, but one nice thing about bulk builds is that there is a huge amount of sharing of the text (code) space. Here's an example of a bulk build relatively early in its cycle (so the C++ compiles aren't eating 1GB each like they do later in the cycle when the larger packages are being built):

http://apollo.backplane.com/DF...

Notice that nothing is blocked on storage accesses. The processes are either in a pure run state or are waiting for a child process to exit.

I've never come close to maxing out the memory BW on an Intel system, at least not with bulk builds. I have maxed out the memory BW on opteron systems but even there one still gets an incremental improvement with more cores.

The real bottleneck for something like the above is not the scheduler or the pegged cpus. The real bottleneck is the operating system which is having to deal with hundreds of fork/exec/run/exit sequences per second and often more than a million VM faults per second (across the whole system)... almost all on shared resources BTW, so it isn't an easy nut for the kernel to crack (think of what it means to the kernel to fork/exec/run/exit something like /bin/sh hundreds of times per second across many cpus all at the same time).

Another big issue for the kernel, for concurrent compiles, is the massive number of shared namecache resources which are getting hit all at once, particularly negative cache hits for files which don't exist (think about compiler include path searches).

These issues tend to trump basic memory BW issues. Memory bandwidth can become an issue, but it will mainly be with jobs which are more memory-centric (access memory more and do less processing / execute fewer instructions per memory access due to the nature of the job). Bulk compiles do not fit into that category.

-Matt

Re: 20 cores DOES matter by doublebackslash · 2015-11-16 05:58 · Score: 1

Replying to undo incorrect moderation. Sorry!

--
md5sum /boot/vmlinuz
d41d8cd98f00b204e9800998ecf8427e /boot/vmlinuz

Speed fallacies. by stoatwblr · 2015-11-16 06:13 · Score: 1

"it could potentially trail behind the Core i7-6700K, a quad-core Skylake processor clocked at 3.4GHz (base) to 4GHz (Turbo)."

Not by much.

If you want to see the true speed of any CPU, look at the memory speed. Internal multipliers make some steps run faster but the overall effect isn't high enough to justify the cost deltas on the higher-clockrate CPUs. In general the sweetspot is 2-4 steps below the top step.

If you have a proper multitasking operating system it will take as much advantage of extra processors even if individual programs don't. For that reason I always bias toward more CPUs than higher clockrate when specifying servers for the datacentre, whilst aiming for maximum possible memory speed (That used to mean trying to keep to one bank on DDR3, but it's a bit easier with LRDIMMS and DDR4)

We don't run much virtualisation, as the kind of loads being run invariably max out the raw systems so there's no point, however in a virtualised environment "More CPUs" always beats "faster ones" as long as hyperthreading is disabled (if a virtualised box gets assigned a HT "CPU" then it will crawl).

Higher CPU clocking is mainly good for willy-waving, other than in quite specific tasks where you can keep everything important in L1/L2.

Re: 20 cores DOES matter by stoatwblr · 2015-11-16 06:35 · Score: 1

"Memory bandwidth can become an issue"

Bandwidth is almost never an issue. _Latency_ is another matter.

There are a lot of tricks and bits to optimise things regarding locality (mainly around row based and lookahead accessing. CPUs aren't the only devices trying to predict what will be read next) and controller optimisation, but the underlaying dynamic ram itself hasn't actually improved much over the last 20 years in terms of time between addressing a random cell and getting an answer back from it. The big improvements have been around the number of requests you can make while waiting for that answer instead of being in request-answer lockstep and there is only so far that can be taken.

_true_ 1GHz ram would have 1ns latency, not 12-30ns - and the reason that L1/L2/L3 is so important is because of that poor response time.

Gillette by DarthVain · 2015-11-16 06:55 · Score: 1

I hear it is used to power the new Gillette razor with 6 blades...

Re:20 cores DOES matter by gman003 · 2015-11-16 15:15 · Score: 1

It is a complicated subject. Some tasks do not benefit from HT - those whose memory access fits entirely within cache, and who make use of operations that cannot be spread among execution units in a core (or where the pattern of operations is superscalar with a single thread).

Simultaneous multithreading (the non-trademark name for HT) offers benefits in certain situations. First, where the memory access pattern is unpredictable and/or uncachable - it essentially lets one thread keep the core working while the other thread waits on the memory access (this breaks down when the task is purely memory-bound - it can actually hurt performance due to cache fighting). The second is when the two threads use different execution units - one is doing mainly compares and branches, while the other is crunching on floating-point. This puts more of the core to work at once. The third is when the two threads use the same execution units, but the core has multiple copies (a Haswell core has like three integer ALUs).

I'm not surprised it failed with computational chemistry. That's about as memory-bound a task as you can find. You probably would see 0% performance improvement from doubling your actual, physical cores, unless you upgraded the memory controller alongside it.

I cited the anecdotal averages I've seen. Some tasks saw 200% speedup at long ago as Nehalem. Others are actually hurt by HT even under Haswell. It's a complicated feature and the variance is substantial.

Re: 20 cores DOES matter by MrKaos · 2015-11-17 03:01 · Score: 1

"Memory bandwidth can become an issue"

Bandwidth is almost never an issue. _Latency_ is another matter.

Thanks stoatwblr, that's exactly what I was talking about.

There are a lot of tricks and bits to optimise things regarding locality (mainly around row based and lookahead accessing. CPUs aren't the only devices trying to predict what will be read next) and controller optimisation, but the underlaying dynamic ram itself hasn't actually improved much over the last 20 years in terms of time between addressing a random cell and getting an answer back from it.

Which is why I'm always trying to tune the amount of application latency so the CPU cycles can be used for actual work. Obviously everyone's workloads are different but it makes sense to provide a bit of a helping hand to the machine (usually so I can go home)

The big improvements have been around the number of requests you can make while waiting for that answer instead of being in request-answer lockstep and there is only so far that can be taken.

That's an interesting development I hadn't heard off. That would have a major impact on reducing application latency, I think I'll have to get my head around how it will affect CPU scheduler behaviour though.

In a similar vein, I've heard of new methods to move CPU scheduler functions to the cpu and arrange the on-chip memory so that it can load tasks physically closer to the actual core processing them so perhaps the two developments are somehow related. I appreciate you completing the picture of what's coming.

In the context of what I'm doing, I'm trying to get more work out of the machine.

_true_ 1GHz ram would have 1ns latency, not 12-30ns - and the reason that L1/L2/L3 is so important is because of that poor response time.

Precisely, if my calculations are correct, it's about the amount of time an instruction on a 1Ghz CPU would take. When I see a lot of context switching and minor page faulting my application is really exposed to that and the cpus are doing a whole lot of nothing useful waiting for L3 to write(mainly) to ram or to fill.

I've seen memory thrashing on productions systems, it's never pretty being in a PIR.

--
My ism, it's full of beliefs.

Slashdot Mirror

Intel Flagship Core i7-6950X Broadwell-E To Offer 10-Cores, 20-Threads, 25MB L3 (hothardware.com)

111 of 167 comments (clear)