Intel's Knights Landing — 72 Cores, 3 Teraflops
New submitter asliarun writes "David Kanter of Realworldtech recently posted his take on Intel's upcoming Knights Landing chip. The technical specs are massive, showing Intel's new-found focus on throughput processing (and possibly graphics). 72 Silvermont cores with beefy FP and vector units, mesh fabric with tile based architecture, DDR4 support with a 384-bit memory controller, QPI connectivity instead of PCIe, and 16GB on-package eDRAM (yes, 16GB). All this should ensure throughput of 3 teraflop/s double precision. Many of the architectural elements would also be the same as Intel's future CPU chips — so this is also a peek into Intel's vision of the future. Will Intel use this as a platform to compete with nVidia and AMD/ATI on graphics? Or will this be another Larrabee? Or just an exotic HPC product like Knights Corner?"
Imagine a Beowulf cluster of these!
Because you can never have too many cores that you aren't using most of the time.
How about more speed? Or is that too hard?
Sent from my ENIAC
Summary asks:
...but first it says it has 16GB of eDRAM. The 128MB is eDRAM in their "Iris Pro" adds almost $200 to the price tag.
Will Intel use this as a platform to compete with nVidia and AMD/ATI on graphics?
This chip is going to cost MANY THOUSANDS OF DOLLARS.
"His name was James Damore."
I wonder how nice these will be to program. The "just recompile and run" promise for Knights Corner was little more than a cruel joke: to get any serious performance out of the current generation of MICs you have to wrestle with vector intrinsics and that stupid in-order architecture. At least the latter will apparently be dropped in Knights Landing.
For what it's worth: I'll be looking forward to NVIDIA's Maxwell. At least CUDA got the vectorization problem sorted out. And no: not even the Intel compiler handles vectorization well.
Computer simulation made easy -- LibGeoDecomp
In my opinion, the point of using x86 in order to reuse units from desktop/server CPUs is the base of these experiments. The counterpart is to deal with the x86-mess everywhere. This seems a desperate reaction to AMD's CPU+GPGPU, which also has drawbacks. I bet that both Intel and AMD prefer to keep memory controller as simpler as possible, having a confortable long-run, without burning their ships too early. E.g. a CPU+GPGPU in the same die, with 8 x 128 bit separate memory controllers configured as NUMA (i.e. without channel interleaving/bonding) would be much better, but it would imply expensive chips, motherboards, and more DRAM chips. So I bet we'll have same-die CPU+GPU plus simple memory controller (even with embbeded RAM in 3D package) for the next 20 years (consumer-grade products).
You aren't ever going to see this at Newegg.
Help stamp out iliturcy.
To bad most Intel cpus don't have it and just about all 2011 boards don't use it. The ones that do use it for dual cpu.
To bad apple mac pro does not have this and is not likely to use any time soon.
Will this be any better on Bitcoin/Litecoin mining than anything else?
Does this by any chance look like either the chip from the Terminator films, or a Borg Cube?
I mean, they've been working on 3d chips for years, right?
Multicore implies more speed only if your process is parallelized. Not all interactive processes on a single-user computer can be, wrote Amdahl.
How long before it's a tiny little square I can drop into my mobo's cpu slot?
Actually Amdahl described the theoretically speed-up given the percentage of a process, that can be parallelized, and the number of processes. Let the latter go towards infinity and you get the maximum theoretical speed-up / minimum theoretical run-time.
This is another one of those IBM things made from the most rare element in the universe: unobtainium. You can't get it here. You can't get it there either. At one point I would have argued otherwise, but no. Cuda cores I can get. This crap I can't get. Its just like the Cell Broadband engine. Remember that? If you bought a PS3, then it had a (slightly crippled) one of those in it. Except that it had no branch prediction. And one of the main cores was disabled. And you couldn't do anything with the integrated graphics. And if you wanted to actually use the co-processor functions, you had to re-write your applications. And you needed to let IBM drill into your teeth and then do a rectal probe before you could get any of the software to make it work. And it only had 256MB of ram. And you couldn't upgrade or expand that. With IBM's new wonder, we get the promise of 72 cores. If you have a dual-xeon processor. And give IBM a million dollars. And you sign a bunch of papers letting them hook up the high voltage rectal probes. Or you could buy a Kepler NVIDIA card which you can install into the system you already own, and it costs about the same as a half-decent monitor. And NVIDIA's software is publicly downloadable. So is this useful to me or 99.999% of the people on /.? No. Its news for nerds, but only four guys can afford it: Bill G., Mark Z., Larry P. and Sergey B..
You saw a speed-up because video and 3D are in a class of problems that are very easy to parallelize. So is decompressing all the images in an HTML document. Laying out the document, on the other hand, isn't so easy to parallelize, if only because every floating box theoretically affects all the boxes that follow it.
implies it, sure.
but a stacked CPU cube built only for watercooling (coolant would travel throughout the cube) would be damn spiffy.
read IBM was pondering somewhere along these lines a while back.
That chip is probably 3 times the size of Knights Landing. Seriously it might be time for a new naming scheme.
OK, we have yet another mesh of processors, an idea that comes back again and again. The details of how processors communicate really matter. Is this is a totally non-shared-memory machine? Is there some shared memory, but it's slow? If there's shared memory, what are the cache consistency rules?
Historically, meshes of processors without shared memory have been painful to program. There's a long line of machines, from the nCube to the Cell, where the hardware worked but the thing was too much of a pain to program. Most designs have suffered from having too little local memory per CPU. If there's enough memory per CPU to, well, run at least a minimal OS and some jobs, then the mesh can be treated as a cluster of intercommunicating peers. That's something for which useful software exists. If all the CPUs have to be treated as slaves of a control machine, then you need all-new software architectures to handle them. This usually results in one-off software that never becomes mature.
Basic truth: we only have three successful multiprocessor architectures that are general purpose - shared-memory multiprocessors, clusters, and GPUs. Everything other than that has been almost useless except for very specialized problems fitted to the hardware. Yet this problem needs to be cracked - single CPUs are not getting much faster.
The article appears to be about rumors mangled together. Socketed chip versus PCIe board, fabric over the PCIe versus over the host processor, while forgetting the chip-integrated possibility. If the DDR4 rumor is correct, it simply suggests that the coming Xeon sockets Intel use for serving HPC and low to mid-end server markets (E5) utilize 6 directly connected memory channels with maximum DIMM size of 64 GB.
I think you'd be surprised how many real world day to day task can be and are parallelized: [...] searching
I thought searching a large collection of documents was disk-bound, and traversing an index was an inherently serial process. Or what parallel data structure for searching did I miss?
rendering web pages
I don't see how rendering a web page can be fully parallelized. Decoding images, yes. Compositing, yes. Parsing and reflow, no. The size of one box affects every box below it, especially when float: is involved. And JavaScript is still single-threaded unless a script is 1. being displayed from a web server (Chrome doesn't support Web Workers in file:// for security reasons), 2. being displayed on a browser other than IE on XP, IE on Vista, and Android Browser <= 4.3 (which don't support Web Workers at all), and 3. not accessing the DOM.
compiling
True, each translation unit can be combined in parallel if you choose not to enable whole-program optimization. But I don't see how whole-program optimization can be done in parallel.
So there will be a useful mainstream CPU closely coupled with a bunch of vector oriented processors that will be hard to use effectively. (Also from TFA).
So unless there is a very high compute to memory access ratio this monster will spend most of it's time waiting for memory and converting electrical energy to heat. Plus writing software that uses 72 cores is such a walk in the park...
Why is Snark Required?
PCIe bandwidth is often the major performance bottleneck to GPGPU applications (i.e., NVIDIA's CUDA). If we suppose that Knights Landing's compute performance is always _worse_ than what NVIDIA can offer, Intel, with QPI, could still push NVIDIA out of HPC for bandwidth-constrained applications. That is, NVIDIA could have the greatest GPU, but if their GPU is stuck on pokey PCIe, Knight's Landing + QPI could still offer better performance. It would be neat if Intel could let NVIDIA on the QPI bus, but that might not even be technically feasible even if Intel were willing to license the technology (as if that would ever happen). AMD also has HyperTransport/Fusion to leverage for their GPGPU solutions. NVIDIA needs to find something better than PCIe.
Even if you are using a single process program, it can benefit from not having to share it's core with the various system processes.
Then there's not really much of a benefit to adding more than a dual core, which will probably end up running the application with which the user is interacting on one core and the background applications and system processes on the other. To go beyond that, you have to either parallelize the application, run more than one CPU-bound application at once (which most desktop PC users tend not to do), or run more than one user at once using dual monitors, dual keyboards, and dual mice (which most desktop PC operating systems tend not to support).
If the program uses async I/O, that counts as parallelism.
That counts as being I/O bound, and if all your processes are I/O bound, even a single core with simultaneous multithreading is enough.
Why stop at a meger 72 cores!
Why not, A Core For Every Instruction!
Then the "nasty" moves to the "central message passing unit parallelized on vectors Ernestine" what a laugh-n!
This of course ignites the "Core Arms Race"!
Then the "USA" vs "World" launch into a "Core Per Syllable Arms Race"!
The butt-zillions of oil-dollars will be spent wildly at an Above Fast N Furious Pace to the ultimate CORE THOUGHT paradigm shift.
Yet, there is the "NSA Wild Card."
[Dealer} What's your bet?
[High Roller] Hold! [;-)]
[Dealer] "Pervert fucker!" ;-)
Imagine having one of those in your smartphone. You could answer text messages 1 microsecond faster. The battery life wouldn't be good.
is still gonna be horrible except in very specific, and highly multithreaded scenarios... some shared parts between cores on a tile.. simply looks like intel is copying amd's current architecture (intel 'tile' = amd 'module'), mixing in an advance of their own (the on-die 'edram'), and throwing more cores at it than even amd has tried.
to keep thermals in check for a single socket package, you're looking at something that draws less than 2w per core under load.. that's a little less than what a current silvermont n2805 mobile celeron draws (4.3w for 2 cores, 1 thread per core), which delivers a whopping ..wait for it.. 228 passmarks per core. the original atom 230 scores a 301 passmarks, and we know how fast that piece of shit was. compare to a current low end desktop chip that runs win vista or newer with 'usable' performance when paired with 4gb ram that scores about 2000 passmarks on 2 cores
In my experience, most cases where compilation takes a long time involve multiple compilation units. I have a fair bit of experience with compiling linux distros professionally...when you're building glibc and the kernel and five hundred other packages it'll use as many cores as you can throw at it.
True, each translation unit can be combined in parallel if you choose not to enable whole-program optimization. But I don't see how whole-program optimization can be done in parallel.
In my experience, most cases where compilation takes a long time involve multiple compilation units.
That's what I said. But a lot of times nowadays, the compiler is set to perform whole-program optimization on release builds to try to save cycles even in calls from a function in one translation unit of a program to a function in another. Mozilla's Firefox web browser, for example, is so big that it can't be compiled with profile-guided whole-program optimization on 32-bit machines. But I'll grant that a multi-core CPU speeds up debug builds.
when you're building glibc and the kernel and five hundred other packages
Not many people are maintainers of an operating system distribution.
when is this new CPU going to be available at Frys
As I wrote elsewhere: laying out a web page that includes float-styled elements. That fits 1) and 2), and it fits 3) on a netbook or tablet with an ARM or Atom processor. Or repaginating a document in a word processor, which happens every time the user enters enough text to make the current paragraph one line longer, deletes enough to make it one line shorter, or changes the styling of any span of text. Repagination may affect figures, references to page numbers elsewhere in the document, etc. Repaginating text after the visible page can be deferred unless there's a "See page n" elsewhere in the document, which may even end up triggering repagination of text before the edit if the new page number has more or fewer digits than the old page number.
Also the PBKDF2 key stretching used to connect to a WPA2 access point, when run on a similarly slow machine.
Also compressing a large still image. I don't see how the DEFLATE codec used by, say, PNG can be parallelized.
I just read up QPI on wiki, and it's a point to point processor interconnect, which replaces the front side bus in Xeon and certain desktop platforms - presumably the cores i7. PCIe, OTOH, is a serial computer expansion bus standard, which can take in things like graphics cards, SSDs, network cards and other such peripheral controllers. I just don't see how QPI is any sort of a replacement for PCIe. That would almost be like arguing for PCIe being superseded by USB4 or something.
Essentially, QPI is Intel's equivalent of the HyperTransport that AMD uses. The PCIe part of it is completely separate - I doubt one will have QPI graphics cards or SSDs
They tested this for the next ipad. While apple felt the 5 second battery life was too short to be practical, the beta testers were more concerned about the apple shaped 3rd degree burns imprinted on their thighs and palms
Some drink at the fountain of knowledge. Others just gargle.
This will be awesome for animating all those tiles on the windows 8 home screen. Might not even be enough cores for that.
My slow ass typing in MS Word will be FASTER than ever!
Shoes for Industry. Shoes for the Dead.
you aren't doing much on your computer. Try doing special effects graphics, or stock market analysis. Or even just start up an Android emulator - it's excruciatingly slow.
Sent from my ENIAC
how would one of these do for software synths and audio processing,from my past use,these appear to use up any and everything you can throw in a pc,some of them need fat gpu's as well on top of as much cpu/ram as possible.
Thanks.
Computer simulation made easy -- LibGeoDecomp
Now, that's a name I've not heard in a long time. A long time.
You make a good point about use of a distributed index. But implementing a distributed index on separate machines will probably lead to far less RAM contention than implementing it on several cores that share one memory.
Parsing and reflow can be efficiently parallelized if sufficient parents have their heights determined by something other than their contents
Good luck determining the height of, say, a Slashdot comment (or, worse, a Slashdot page's entire comment section) other than by its contents. No, heights can't practically be fixed server-side because different machines have different viewport widths, different fonts installed, and different hinting algorithms that affect letter spacing. All of these affect how many lines a paragraph uses.
Even without that, couldn't the children each be processed in parallel for a good portion of them, but possibly needing updating for properties that have dependencies outside of themselves?
Only for documents that don't have floats and declare explicit heights for everything, which I don't think includes the majority of documents.
Say your parser has been parsing several kilobytes of a document, and it hits a quotation mark character (U+0022 or U+0027). Is it an open quote, starting a string, or a close quote, which means throw out everything it has parsed so far and treat it as the end of a string?
Nothing unreal exists, by definition.
Except, of course, for Unreal and other games using Unreal Engine.
No, anytime you are doing multiple things at once you are better off with more cores. If you are watching a video
Foreground application. I'll grant that multiple cores help with decoding really big (1080p or bigger) video, but so does a specialized H.264-specific core or moving half of the decoder to OpenCL.
or playing music
Background application. On an Intel Core CPU, decoding music uses so little CPU power nowadays that it stays within single digit percent utilization of a core. Even on the puny little Atom N450 CPU (1 core, 2x SMT) in my four-year-old Dell netbook, I just measured VLC playing an ogg file at 15% of one half-core.
or encoding/decoding content like running a media server
You said the S word. When a "server" enters the picture, I agree that larger core counts become easier to justify, as background processing begins to dominate. But I'd like to see statistics on how popular PC-based home media servers are in the first place.
You are trying to generalize it for all use cases but not all use cases are the same
I'm trying to find what use cases are most common because economies of scale benefit the most common use cases.
Good luck getting both browser makers and web site publishers to adopt a PNG variant using lz4. The biggest thing that led to PNG adoption in the first place was Unisys's LZW patent assertion. Besides, even after decompression, PNG decoding has a filtering phase where each line depends on the line above it. That can be parallelized by adding unfiltered lines at compression time at the cost of compression ratio.
Generalizing only slightly: a single processor chasing pointers will have a hard time maxing out the DDR throughput, although it will definitely be memory bottlenecked due to latency. Multiple processors all doing the same thing on the same memory will not, as a result, compete for bandwidth. Instead, their requests will execute in turn in the DDR
Won't the DDR take "50 to 150 cycles" to service each request? Or is there some sort of pipelining going on, where the DDR can take a request every 10 cycles but have a whole bunch of queued requests in flight? To take an analogy between DDR and that other DDR, are the requests like a column of arrows on the screen, where I see each arrow a measure before I have to hit it?
Besides, in a RAM latency-bound situation, there's little benefit of multiple full hardware cores over the virtual cores in a simultaneous multithreading architecture such as Intel's Hyper-Threading Technology or the "modules" that AMD introduced with Bulldozer. Furthermore, keeping all these requests in flight requires some sort of synchronization among threads, which when implemented wrong introduces plenty of locking overhead.
http://venturebeat.com/2014/01/05/nvidia-announces-tegra-k1-a-super-mobile-chip-with-192-cores/
as to how many NSA backdoors this will feature.
I can't wait to see what SGI do with this chip :)
Max.
Cray made a brand name of being THE Super-Computer. Now it is out of news scope, at least... Are these processors equivalent, better or still behind? Things started going confusing after Pentium and the sudden increase in clock speeds, but it was clear those were not supercomputers at all. In the eternal struggle to TRULY dedicate a computer to do a single thing...