Intel's Knights Landing — 72 Cores, 3 Teraflops

← Back to Stories (view on slashdot.org)

Intel's Knights Landing — 72 Cores, 3 Teraflops

Posted by Soulskill on Saturday January 4, 2014 @11:07AM from the go-big-or-go-home dept.

New submitter asliarun writes "David Kanter of Realworldtech recently posted his take on Intel's upcoming Knights Landing chip. The technical specs are massive, showing Intel's new-found focus on throughput processing (and possibly graphics). 72 Silvermont cores with beefy FP and vector units, mesh fabric with tile based architecture, DDR4 support with a 384-bit memory controller, QPI connectivity instead of PCIe, and 16GB on-package eDRAM (yes, 16GB). All this should ensure throughput of 3 teraflop/s double precision. Many of the architectural elements would also be the same as Intel's future CPU chips — so this is also a peek into Intel's vision of the future. Will Intel use this as a platform to compete with nVidia and AMD/ATI on graphics? Or will this be another Larrabee? Or just an exotic HPC product like Knights Corner?"

34 of 208 comments (clear)

Min score:

Reason:

Sort:

Imagine by Konster · 2014-01-04 11:14 · Score: 3, Funny

Imagine a Beowulf cluster of these!
1. Re:Imagine by Anonymous Coward · 2014-01-04 19:06 · Score: 3, Funny
  
  Forget about Linux! With this baby, I can finally run Crysis.
Programmability? by gentryx · 2014-01-04 11:20 · Score: 4, Informative

I wonder how nice these will be to program. The "just recompile and run" promise for Knights Corner was little more than a cruel joke: to get any serious performance out of the current generation of MICs you have to wrestle with vector intrinsics and that stupid in-order architecture. At least the latter will apparently be dropped in Knights Landing.
For what it's worth: I'll be looking forward to NVIDIA's Maxwell. At least CUDA got the vectorization problem sorted out. And no: not even the Intel compiler handles vectorization well.

--
Computer simulation made easy -- LibGeoDecomp
1. Re:Programmability? by PhrostyMcByte · 2014-01-04 13:42 · Score: 2
  
  Intel's AVX-512 is really friggin cool, and a huge departure from their SIMD of the past. It adds some important features -- most notably mask registers to optimally support complex branching -- which make it nearly identical to GPU coding so that compilers will have a dramatically easier time targeting it. I doubt it will kill discrete GPUs any time soon, but it's a big step in that long-term direction.
2. Re:Programmability? by Arakageeta · 2014-01-04 13:47 · Score: 2
  
  It's not entirely syntactical. Local shared memory is exposed to the CUDA programmer (e.g., __sync_threads()). CUDA programmers also have to be mindful of register pressure and the L1 cache. These issues directly affect the algorithms used by CUDA programmers. CUDA programmers have control over very fast local memory---I believe that this level of control is missing from MIC's available programming models. Being closer to the metal usually means a harder time programming, but higher performance potential. However, I believe NVIDIA has made CUDA pretty programmer friendly, given the architectural constraints. I'd like to hear the opinions of MIC programmers, since I have no direct experience with MIC.
3. Re:Programmability? by godrik · 2014-01-04 14:07 · Score: 2
  
  I don't understand. Mic is your regular cache based architecture. Accessing L1 cache in mic is very fast (3 cycle latency if my memory is correct). You have similar register constraints on mic with 32 512-bit vectors per thread(core maybe). Both architectures overlap memory latency by using hardware threading.
  I programmed both mic and gpu, mainly on sparse algebra and graph kernels. And quite frankly there are differences but i find much more alike than most people acknowledge. The main difference in my opinion being the programming model where gpus are used with millions of threads while mic is better used with less number of threads and more of a work pool. Atomics are really fast in gpus and not so fast in mic. But you also have much more fine thread synchronization opportunities in mic whichsomewhat remove the interest of fast atomics.
4. Re:Programmability? by KonoWatakushi · 2014-01-04 22:37 · Score: 2
  
  The recently revealed Mill architecture is far more interesting, and also offers a much more attractive programming model. It is a highly orthogonal architecture naturally capable of wide MIMD and SIMD. Vectorization and software pipelining of loops is discussed in the "metadata" talk, and is very clever and elegant. Those who have personally experienced the tedium of typical vector extensions will appreciate it all the more.
  Based on sim, the creators expect an order of magnitude improvement of performance/W/$ over conventional architectures. (That being a very conservative estimate, with the goal being DSP like efficiency on general purpose code.) How they propose to achieve that is fascinating, but I'm even more exited about the potential impact on software development. The architecture described will vastly simplify the OS and compilers, and remove or greatly reduce a number of typical inefficiencies.
  Knight's Landing leaves me with the usual impression of Intel using brute force and process superiority to retain the edge, and the Mill may offer enough of an architectural improvement to finally put and end to that. It would still be a long road, but it is a nice thought.
Re:Yay more cores that I won't be using much of! by Anonymous Coward · 2014-01-04 11:23 · Score: 3, Insightful

Yes, it's too hard. The future is in concurrency. The actor model will probably take off since it's easy to pick up and use.
Re:Yay more cores that I won't be using much of! by icebike · 2014-01-04 11:35 · Score: 2, Insightful

Because you can never have too many cores that you aren't using most of the time.
How about more speed? Or is that too hard?
Pretty sure it wasn't meant for you (or me).

--
Sig Battery depleted. Reverting to safe mode.
Requires parallelism by tepples · 2014-01-04 11:42 · Score: 5, Informative

Multicore implies more speed only if your process is parallelized. Not all interactive processes on a single-user computer can be, wrote Amdahl.
1. Re:Requires parallelism by Morpf · 2014-01-04 12:19 · Score: 2
  
  I think you'd be surprised how many real world day to day task can be and are parallelized: almost everything concerning audio and video (images or movies), searching, analyzing, rendering web pages, compiling, computing physics and AI for games.
  I can't think of one computing intensive day to day action that is not parallelized or wouldn't be easy to do so.
Re:Yay more cores that I won't be using much of! by H0p313ss · 2014-01-04 11:49 · Score: 4, Insightful

Because you can never have too many cores that you aren't using most of the time.
How about more speed? Or is that too hard?
Pretty sure it wasn't meant for you (or me).
However, for servers, including hypervisors, it would be very interesting. There are lots of client/server products that scale better with more cores.

--
XML is a known as a key material required to create SMD: Software of Mass Destruction
Re:No it cannot compete with nVidia and AMD/ATI by rsmith-mac · 2014-01-04 12:04 · Score: 5, Informative

"eDRAM" in this article is almost certainly an error for that reason.
eDRAM isn't very well defined, but it basically boils down to "DRAM manufactured on a modified logic process," allowing it to be placed on-die alongside logic, or at the very least built using the same tools if you're a logic house (Intel, TSMC, etc). This is as opposed to traditional DRAM, which is made on dedicated processes that is optimized for space (capacitors) and follows its own development cadence.
The article notes that this is on-package as opposed to on-die memory, which under most circumstances would mean regular DRAM would work just fine. The biggest example of on-package RAM would be SoCs, where the DRAM is regularly placed in the same package for size/convenience and then wire-bonded to the processor die (although alternative connections do exist). Conversely eDRAM is almost exclusively used on-die with logic - this being its designed use - chiefly as a higher density/lower performance alternative to SRAM. You can do off-die eDRAM, which is what Intel does for Crystalwell, but that's almost entirely down to Intel using spare fab capacity and keeping production in house (they don't make DRAM) as opposed to technical requirements. Which is why you don't see off-die eDRAM regularly used.
Or to put it bluntly, just because DRAM is on-package doesn't mean it's eDRAM. There are further qualifications to making it eDRAM than moving the DRAM die closer to the CPU.
But ultimately as you note cost would be an issue. Even taking into account process advantages between now and the Knight's Landing launch, 16GB of eDRAM would be huge. Mind bogglingly huge. Many thousands of square millimeters huge. Based on space constraints alone it can't be eDRAM; it has to be DRAM to make that aspect work, and even then 16GB of DRAM wouldn't be small.
Unobtainium by Anonymous Coward · 2014-01-04 12:12 · Score: 3, Insightful

This is another one of those IBM things made from the most rare element in the universe: unobtainium. You can't get it here. You can't get it there either. At one point I would have argued otherwise, but no. Cuda cores I can get. This crap I can't get. Its just like the Cell Broadband engine. Remember that? If you bought a PS3, then it had a (slightly crippled) one of those in it. Except that it had no branch prediction. And one of the main cores was disabled. And you couldn't do anything with the integrated graphics. And if you wanted to actually use the co-processor functions, you had to re-write your applications. And you needed to let IBM drill into your teeth and then do a rectal probe before you could get any of the software to make it work. And it only had 256MB of ram. And you couldn't upgrade or expand that. With IBM's new wonder, we get the promise of 72 cores. If you have a dual-xeon processor. And give IBM a million dollars. And you sign a bunch of papers letting them hook up the high voltage rectal probes. Or you could buy a Kepler NVIDIA card which you can install into the system you already own, and it costs about the same as a half-decent monitor. And NVIDIA's software is publicly downloadable. So is this useful to me or 99.999% of the people on /.? No. Its news for nerds, but only four guys can afford it: Bill G., Mark Z., Larry P. and Sergey B..
1. Re:Unobtainium by radarskiy · 2014-01-04 15:22 · Score: 2
  
  "Its news for nerds, but only four guys can afford it: Bill G., Mark Z., Larry P. and Sergey B."
  I would rather have that market than all of the rest.
Embarrassingly parallel by tepples · 2014-01-04 12:20 · Score: 4, Informative

You saw a speed-up because video and 3D are in a class of problems that are very easy to parallelize. So is decompressing all the images in an HTML document. Laying out the document, on the other hand, isn't so easy to parallelize, if only because every floating box theoretically affects all the boxes that follow it.
Re:No it cannot compete with nVidia and AMD/ATI by im_thatoneguy · 2014-01-04 12:28 · Score: 2

An Nvidia Quadro card costs $8,000 for an 8GB card. I would consider $8,000 "many thousands of dollars". Nobody is suggesting Knights ____ is competing with any consumer chips CPU or GPU. I have a $1,500 Raytracing card in my system along with a $1,000 GPU as well as a $1,000 CPU. If this could replace the CPU and GPU but compete with a dual CPU system for rendering performance I would be a happy camper even if it cost $3-4k.
How does the intercommunication work? by Animats · 2014-01-04 12:33 · Score: 4, Informative

OK, we have yet another mesh of processors, an idea that comes back again and again. The details of how processors communicate really matter. Is this is a totally non-shared-memory machine? Is there some shared memory, but it's slow? If there's shared memory, what are the cache consistency rules?
Historically, meshes of processors without shared memory have been painful to program. There's a long line of machines, from the nCube to the Cell, where the hardware worked but the thing was too much of a pain to program. Most designs have suffered from having too little local memory per CPU. If there's enough memory per CPU to, well, run at least a minimal OS and some jobs, then the mesh can be treated as a cluster of intercommunicating peers. That's something for which useful software exists. If all the CPUs have to be treated as slaves of a control machine, then you need all-new software architectures to handle them. This usually results in one-off software that never becomes mature.
Basic truth: we only have three successful multiprocessor architectures that are general purpose - shared-memory multiprocessors, clusters, and GPUs. Everything other than that has been almost useless except for very specialized problems fitted to the hardware. Yet this problem needs to be cracked - single CPUs are not getting much faster.
1. Re:How does the intercommunication work? by joib · 2014-01-04 19:47 · Score: 4, Informative
  
  The mesh replaces the ring bus used in the current generation MIC as well as mainstream Intel x86 CPU's. Each node in the mesh is 2 CPU cores and L2 cache. The mesh is used for connecting to the DRAM controllers, external interfaces, L3 cache, and of course, for cache coherency. The memory consistency model is the standard x86 one. So from a programmability point of view, it's a multi-core x86 processor, albeit with slow serial performance and beefy vector units.
I fail to see parallelism in CSS flow by tepples · 2014-01-04 12:44 · Score: 3, Insightful

I think you'd be surprised how many real world day to day task can be and are parallelized: [...] searching
I thought searching a large collection of documents was disk-bound, and traversing an index was an inherently serial process. Or what parallel data structure for searching did I miss?

rendering web pages
I don't see how rendering a web page can be fully parallelized. Decoding images, yes. Compositing, yes. Parsing and reflow, no. The size of one box affects every box below it, especially when float: is involved. And JavaScript is still single-threaded unless a script is 1. being displayed from a web server (Chrome doesn't support Web Workers in file:// for security reasons), 2. being displayed on a browser other than IE on XP, IE on Vista, and Android Browser <= 4.3 (which don't support Web Workers at all), and 3. not accessing the DOM.

compiling
True, each translation unit can be combined in parallel if you choose not to enable whole-program optimization. But I don't see how whole-program optimization can be done in parallel.
1. Re:I fail to see parallelism in CSS flow by Anonymous Coward · 2014-01-04 13:15 · Score: 2, Funny
  
  Are you in Colorado?
Re:Yay more cores that I won't be using much of! by morcego · 2014-01-04 12:59 · Score: 5, Funny

Because you can never have too many cores that you aren't using most of the time.
Install McAfee Antivírus, and problem solved: no more unused cores.

--
morcego
Re:Andahl's Law by sjames · 2014-01-04 13:35 · Score: 2

Keep in mind, Amdahl's law can be expanded to all processes that make up a system. Even if you are using a single process program, it can benefit from not having to share it's core with the various system processes.
If the program uses async I/O, that counts as parallelism.
Re:No it cannot compete with nVidia and AMD/ATI by Anonymous Coward · 2014-01-04 13:50 · Score: 3, Informative

It may not be eDRAM, but I'm not sure what else Intel would easily package with the chip. We know the 128 MB of eDRAM on 22 nm is ~80 mm^2 of silicon, currently Intel is selling ~100 mm^2 of N-1 node silicon for ~$10 or less (See all the ultra cheap 32 nm clover trail+ tablets where they're winning sockets against allwinner, rockchip, etc., indicating that they must be selling them for equivalent or better prices than these companies.) By the time this product comes out 22 nm will be the N-1 node. In addition, a dedicated eDRAM chip is probably cheaper than a typical SoC/logic chip due to the smaller number of metal levels that are needed. Assuming N-1 node prices hold for a given area of silicon, 16 GB will need 12000 mm^2 of silicon (likely less as the current 128 MB die likely uses a not insignificant area for readout circuitry and PHY interface), coming out to around $1200. Add an extra $1000 for your actual processor and you have the current price of a low end Xeon Phi.
Intel's version of a IBM/Sony Cell CPU by Required+Snark · 2014-01-04 13:51 · Score: 2

This will have the same useability as the CELL CPU. From TFA:

Second, while Knights Landing can act as a bootable CPU, many applications will demand greater single threaded performance due to Amdahl’s Law. For these workloads, the optimal configuration is a Knights Landing (which provides high throughput) coupled to a mainstream Xeon server (which provides single threaded performance). In this scenario, latency is critical for communicating results between the Xeon and Knights Landing.

So there will be a useful mainstream CPU closely coupled with a bunch of vector oriented processors that will be hard to use effectively. (Also from TFA).

The rumors also state that the KNL core will replace each of the floating point pipelines in Silvermont with a full blown 512-bit AVX3 vector unit, doubling the FLOPs/clock to 32.

So unless there is a very high compute to memory access ratio this monster will spend most of it's time waiting for memory and converting electrical energy to heat. Plus writing software that uses 72 cores is such a walk in the park...

--
Why is Snark Required?
Imagine, 2 by Futurepower(R) · 2014-01-04 14:28 · Score: 2, Funny

Imagine having one of those in your smartphone. You could answer text messages 1 microsecond faster. The battery life wouldn't be good.
Re:Then a dual core should be plenty by sjames · 2014-01-04 14:36 · Score: 2

Not necessarily. A process could be CPU bound and prefer not to make it worse by also waiting for I/O completion. Let another core drive the filesystem and talk to the block device (which might be a soft RAID).
My system frequent;y enough is busy compressing video or doing large compiles in the background while I work in the foreground.
If all you're doing is word processing, single thread speed isn't all that important either since it's mostly waiting for you to press a key.
Re:Bitcoin/Litecoin Performance by InvalidError · 2014-01-04 15:01 · Score: 4, Interesting

BitCoin has ASIC miners with ~10X the mining power per watt than most programmable alternatives such as GPGPU and FPGA. Anything less efficient than that is or soon will become cost-prohibitive to run.
The newer Bitcoin alternatives use memory-bound algorithms to prevent such a steep mining power escalation since memory capacity and bandwidth scale much more slowly than processing power but much more quickly on costs: with Bitcoin, increasing throughput by 10X simply required 10X the processing power but with the memory-bound alternatives, you also need 10X the RAM and 10X the memory bandwidth.
Re:Then a dual core should be plenty by tepples · 2014-01-04 15:03 · Score: 2

To go beyond that, you have to either parallelize the application, run more than one CPU-bound application at once (which most desktop PC users tend not to do)
Let another core drive the filesystem and talk to the block device (which might be a soft RAID).

My system frequent;y enough is busy compressing video or doing large compiles in the background while I work in the foreground.
Then you're not most users. I was under the impression that most users tend not to use soft RAID 5/6 or CPU-intensive file systems, compress large videos, or do large compiles. I too compress video and do compiles, but geeks such as you and myself are edge cases.
ipad by goombah99 · 2014-01-04 16:43 · Score: 4, Funny

They tested this for the next ipad. While apple felt the 5 second battery life was too short to be practical, the beta testers were more concerned about the apple shaped 3rd degree burns imprinted on their thighs and palms

--
Some drink at the fountain of knowledge. Others just gargle.
1. Re: ipad by Macthorpe · 2014-01-05 01:56 · Score: 5, Funny
  
  To be fair, Apple are very committed to branding.
  
  --
  "It does not do to leave a live dragon out of your calculations, if you live near him." - Tolkien
Re:Yay more cores that I won't be using much of! by Krishnoid · 2014-01-04 17:40 · Score: 2

Pretty sure it wasn't meant for you (or me).
Obviously -- 64 cores should be enough for any one person.
Re:Yay more cores that I won't be using much of! by Guy+Harris · 2014-01-04 19:57 · Score: 3, Informative

Where are you getting Atom cores from?
From this Extremetech article, which has a slide speaking of the Knights Landing processor architecture having "up to 72 Intel Architecture cores based on Silvermont (Intel(R) Atom processor)"?
Re:Yay more cores that I won't be using much of! by Bengie · 2014-01-05 07:33 · Score: 3, Informative

You're both correct. The original Atom cpu was built separately and started before the i7 arch. The new Silvermont "Atom" is based a lot of the i7 arch. It is a huge upgrade to the Atom line. It's like the original i7 fine tuned for power and running on 22nm. Very strong OoO pipeline design. The low power usage is great for a many core design because efficiently is more important than single-threaded performance.