Intel's Knights Landing — 72 Cores, 3 Teraflops
New submitter asliarun writes "David Kanter of Realworldtech recently posted his take on Intel's upcoming Knights Landing chip. The technical specs are massive, showing Intel's new-found focus on throughput processing (and possibly graphics). 72 Silvermont cores with beefy FP and vector units, mesh fabric with tile based architecture, DDR4 support with a 384-bit memory controller, QPI connectivity instead of PCIe, and 16GB on-package eDRAM (yes, 16GB). All this should ensure throughput of 3 teraflop/s double precision. Many of the architectural elements would also be the same as Intel's future CPU chips — so this is also a peek into Intel's vision of the future. Will Intel use this as a platform to compete with nVidia and AMD/ATI on graphics? Or will this be another Larrabee? Or just an exotic HPC product like Knights Corner?"
Imagine a Beowulf cluster of these!
Summary asks:
...but first it says it has 16GB of eDRAM. The 128MB is eDRAM in their "Iris Pro" adds almost $200 to the price tag.
Will Intel use this as a platform to compete with nVidia and AMD/ATI on graphics?
This chip is going to cost MANY THOUSANDS OF DOLLARS.
"His name was James Damore."
Because you can never have too many cores that you aren't using most of the time.
Ask the NSA, they might have a (SECRET) opinion on that.
If you want news from today, you have to come back tomorrow.
I wonder how nice these will be to program. The "just recompile and run" promise for Knights Corner was little more than a cruel joke: to get any serious performance out of the current generation of MICs you have to wrestle with vector intrinsics and that stupid in-order architecture. At least the latter will apparently be dropped in Knights Landing.
For what it's worth: I'll be looking forward to NVIDIA's Maxwell. At least CUDA got the vectorization problem sorted out. And no: not even the Intel compiler handles vectorization well.
Computer simulation made easy -- LibGeoDecomp
Yes, it's too hard. The future is in concurrency. The actor model will probably take off since it's easy to pick up and use.
In my opinion, the point of using x86 in order to reuse units from desktop/server CPUs is the base of these experiments. The counterpart is to deal with the x86-mess everywhere. This seems a desperate reaction to AMD's CPU+GPGPU, which also has drawbacks. I bet that both Intel and AMD prefer to keep memory controller as simpler as possible, having a confortable long-run, without burning their ships too early. E.g. a CPU+GPGPU in the same die, with 8 x 128 bit separate memory controllers configured as NUMA (i.e. without channel interleaving/bonding) would be much better, but it would imply expensive chips, motherboards, and more DRAM chips. So I bet we'll have same-die CPU+GPU plus simple memory controller (even with embbeded RAM in 3D package) for the next 20 years (consumer-grade products).
You aren't ever going to see this at Newegg.
Help stamp out iliturcy.
Because you can never have too many cores that you aren't using most of the time.
How about more speed? Or is that too hard?
Pretty sure it wasn't meant for you (or me).
Sig Battery depleted. Reverting to safe mode.
To bad most Intel cpus don't have it and just about all 2011 boards don't use it. The ones that do use it for dual cpu.
To bad apple mac pro does not have this and is not likely to use any time soon.
Will this be any better on Bitcoin/Litecoin mining than anything else?
Doesn't multi core imply more speed, if not by clock then by efficiency?
"If any question why we died, Tell them because our fathers lied."
That's an hpc processor. You are unlikely to deploy that on classical desktop/laptop for a while. Think about it as a classical coprocessor.
Multicore implies more speed only if your process is parallelized. Not all interactive processes on a single-user computer can be, wrote Amdahl.
Because you can never have too many cores that you aren't using most of the time.
How about more speed? Or is that too hard?
Pretty sure it wasn't meant for you (or me).
However, for servers, including hypervisors, it would be very interesting. There are lots of client/server products that scale better with more cores.
XML is a known as a key material required to create SMD: Software of Mass Destruction
It depends on the use case. There are many applications where this would shine. Sure if you want to play Quake 3 Arena it's not going to give you much at all, but if you're doing parallel processing for scientific or engineering applications this would rock.
This is another one of those IBM things made from the most rare element in the universe: unobtainium. You can't get it here. You can't get it there either. At one point I would have argued otherwise, but no. Cuda cores I can get. This crap I can't get. Its just like the Cell Broadband engine. Remember that? If you bought a PS3, then it had a (slightly crippled) one of those in it. Except that it had no branch prediction. And one of the main cores was disabled. And you couldn't do anything with the integrated graphics. And if you wanted to actually use the co-processor functions, you had to re-write your applications. And you needed to let IBM drill into your teeth and then do a rectal probe before you could get any of the software to make it work. And it only had 256MB of ram. And you couldn't upgrade or expand that. With IBM's new wonder, we get the promise of 72 cores. If you have a dual-xeon processor. And give IBM a million dollars. And you sign a bunch of papers letting them hook up the high voltage rectal probes. Or you could buy a Kepler NVIDIA card which you can install into the system you already own, and it costs about the same as a half-decent monitor. And NVIDIA's software is publicly downloadable. So is this useful to me or 99.999% of the people on /.? No. Its news for nerds, but only four guys can afford it: Bill G., Mark Z., Larry P. and Sergey B..
In practice, the percentage of a process on a single-user system that can be parallelized is rarely 100 percent. If one holds the performance of a core constant, even a 1000 core system will still run as slowly as a 1 core system on the fraction that cannot.
You saw a speed-up because video and 3D are in a class of problems that are very easy to parallelize. So is decompressing all the images in an HTML document. Laying out the document, on the other hand, isn't so easy to parallelize, if only because every floating box theoretically affects all the boxes that follow it.
OK, we have yet another mesh of processors, an idea that comes back again and again. The details of how processors communicate really matter. Is this is a totally non-shared-memory machine? Is there some shared memory, but it's slow? If there's shared memory, what are the cache consistency rules?
Historically, meshes of processors without shared memory have been painful to program. There's a long line of machines, from the nCube to the Cell, where the hardware worked but the thing was too much of a pain to program. Most designs have suffered from having too little local memory per CPU. If there's enough memory per CPU to, well, run at least a minimal OS and some jobs, then the mesh can be treated as a cluster of intercommunicating peers. That's something for which useful software exists. If all the CPUs have to be treated as slaves of a control machine, then you need all-new software architectures to handle them. This usually results in one-off software that never becomes mature.
Basic truth: we only have three successful multiprocessor architectures that are general purpose - shared-memory multiprocessors, clusters, and GPUs. Everything other than that has been almost useless except for very specialized problems fitted to the hardware. Yet this problem needs to be cracked - single CPUs are not getting much faster.
I think you'd be surprised how many real world day to day task can be and are parallelized: [...] searching
I thought searching a large collection of documents was disk-bound, and traversing an index was an inherently serial process. Or what parallel data structure for searching did I miss?
rendering web pages
I don't see how rendering a web page can be fully parallelized. Decoding images, yes. Compositing, yes. Parsing and reflow, no. The size of one box affects every box below it, especially when float: is involved. And JavaScript is still single-threaded unless a script is 1. being displayed from a web server (Chrome doesn't support Web Workers in file:// for security reasons), 2. being displayed on a browser other than IE on XP, IE on Vista, and Android Browser <= 4.3 (which don't support Web Workers at all), and 3. not accessing the DOM.
compiling
True, each translation unit can be combined in parallel if you choose not to enable whole-program optimization. But I don't see how whole-program optimization can be done in parallel.
Because you can never have too many cores that you aren't using most of the time.
Install McAfee Antivírus, and problem solved: no more unused cores.
morcego
Keep in mind, Amdahl's law can be expanded to all processes that make up a system. Even if you are using a single process program, it can benefit from not having to share it's core with the various system processes.
If the program uses async I/O, that counts as parallelism.
So there will be a useful mainstream CPU closely coupled with a bunch of vector oriented processors that will be hard to use effectively. (Also from TFA).
So unless there is a very high compute to memory access ratio this monster will spend most of it's time waiting for memory and converting electrical energy to heat. Plus writing software that uses 72 cores is such a walk in the park...
Why is Snark Required?
Even if you are using a single process program, it can benefit from not having to share it's core with the various system processes.
Then there's not really much of a benefit to adding more than a dual core, which will probably end up running the application with which the user is interacting on one core and the background applications and system processes on the other. To go beyond that, you have to either parallelize the application, run more than one CPU-bound application at once (which most desktop PC users tend not to do), or run more than one user at once using dual monitors, dual keyboards, and dual mice (which most desktop PC operating systems tend not to support).
If the program uses async I/O, that counts as parallelism.
That counts as being I/O bound, and if all your processes are I/O bound, even a single core with simultaneous multithreading is enough.
Imagine having one of those in your smartphone. You could answer text messages 1 microsecond faster. The battery life wouldn't be good.
In my experience, most cases where compilation takes a long time involve multiple compilation units. I have a fair bit of experience with compiling linux distros professionally...when you're building glibc and the kernel and five hundred other packages it'll use as many cores as you can throw at it.
This isn't intended for you if you can't think of what to do with all those cores.
This is for the high performance physics folks to whom the difference between 16 cores, 256 cores, and maybe even 8192 cores is a line in a config file.
It's also for the folks developing 24 megapixel RAW files (which Nikon's cheapest SLR spits out these days), where splitting the image into 64 sectors is no more difficult than splitting it into four, or for the folks doing video encoding which is pretty trivially parallelizable.
Most of the times that I can think of where I'm truly waiting on my computer to do something that's limited by the number of flops that can be brought to bear, more cores is just as good as more speed.
True, each translation unit can be combined in parallel if you choose not to enable whole-program optimization. But I don't see how whole-program optimization can be done in parallel.
In my experience, most cases where compilation takes a long time involve multiple compilation units.
That's what I said. But a lot of times nowadays, the compiler is set to perform whole-program optimization on release builds to try to save cycles even in calls from a function in one translation unit of a program to a function in another. Mozilla's Firefox web browser, for example, is so big that it can't be compiled with profile-guided whole-program optimization on 32-bit machines. But I'll grant that a multi-core CPU speeds up debug builds.
when you're building glibc and the kernel and five hundred other packages
Not many people are maintainers of an operating system distribution.
As I wrote elsewhere: laying out a web page that includes float-styled elements. That fits 1) and 2), and it fits 3) on a netbook or tablet with an ARM or Atom processor. Or repaginating a document in a word processor, which happens every time the user enters enough text to make the current paragraph one line longer, deletes enough to make it one line shorter, or changes the styling of any span of text. Repagination may affect figures, references to page numbers elsewhere in the document, etc. Repaginating text after the visible page can be deferred unless there's a "See page n" elsewhere in the document, which may even end up triggering repagination of text before the edit if the new page number has more or fewer digits than the old page number.
Also the PBKDF2 key stretching used to connect to a WPA2 access point, when run on a similarly slow machine.
Also compressing a large still image. I don't see how the DEFLATE codec used by, say, PNG can be parallelized.
I just read up QPI on wiki, and it's a point to point processor interconnect, which replaces the front side bus in Xeon and certain desktop platforms - presumably the cores i7. PCIe, OTOH, is a serial computer expansion bus standard, which can take in things like graphics cards, SSDs, network cards and other such peripheral controllers. I just don't see how QPI is any sort of a replacement for PCIe. That would almost be like arguing for PCIe being superseded by USB4 or something.
Essentially, QPI is Intel's equivalent of the HyperTransport that AMD uses. The PCIe part of it is completely separate - I doubt one will have QPI graphics cards or SSDs
Where are you getting Atom cores from? I read up QPI, which this design will be using, and that is used only w/ Xeons and i7s. So this chip looks pretty much like the successor to Xeons and i7s, and will probably be seen either in servers, or in Mac Pros, but not likely in your average laptop, much less tablet.
They tested this for the next ipad. While apple felt the 5 second battery life was too short to be practical, the beta testers were more concerned about the apple shaped 3rd degree burns imprinted on their thighs and palms
Some drink at the fountain of knowledge. Others just gargle.
Because you can never have too many cores that you aren't using most of the time.
Sure you can, and 640 is obviously the threshold. Nobody would ever need more than that.
Pretty sure it wasn't meant for you (or me).
Obviously -- 64 cores should be enough for any one person.
My slow ass typing in MS Word will be FASTER than ever!
Shoes for Industry. Shoes for the Dead.
you aren't doing much on your computer. Try doing special effects graphics, or stock market analysis. Or even just start up an Android emulator - it's excruciatingly slow.
Sent from my ENIAC
Did you miss the part in the article about 512-bit AVX and being able to do 32 double precision floating point operations per clock? Or the other part about running four-way SMT to hide memory system latency? Or the other, other part about 128 byte (1024-bit) L1D to CPU bandwidth?
These ain't plain ol' Atom processors.
For HPC workloads, these seem to be right up the alley of "heavy lifting."
Program Intellivision!
For GPUs, until we have one core per pixel for ray-tracing, we're nowhere near the number of cores we could use without even trying too hard.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Where are you getting Atom cores from?
From this Extremetech article, which has a slide speaking of the Knights Landing processor architecture having "up to 72 Intel Architecture cores based on Silvermont (Intel(R) Atom processor)"?
Winterfell?
I have something in common with Stephen Hawking...
how would one of these do for software synths and audio processing,from my past use,these appear to use up any and everything you can throw in a pc,some of them need fat gpu's as well on top of as much cpu/ram as possible.
Couple things:
1) The 22nm Silvermont Atom cores are a complete redesign over the badly aging atom cores. They are much more powerful, and much more powerful per watt.
2) These chips aren't going to replace CPUs, they are most likely going to compete with Nvidia Tesla - a PCIe card that highly parallel workloads can be offloaded to. One CUDA core isn't very powerful but stick 2688 actives ones on a chip and for certain tasks you have a lot of power. The K20X Tesla is capable of 1.3 trillion double-precision FLOPS so if Knights Landing can actually do 3 trillion it's no weak chip.
Thanks.
Computer simulation made easy -- LibGeoDecomp
Actually, if the price is right, it *is* meant for a project I'm working on. I don't need it this year, so it may actually be a possibility. ... but the year after that, or the year after that..."
"Not this August
(Well that's just free association. I left out the context of application, because it didn't apply.)
I think we've pushed this "anyone can grow up to be president" thing too far.
You're both correct. The original Atom cpu was built separately and started before the i7 arch. The new Silvermont "Atom" is based a lot of the i7 arch. It is a huge upgrade to the Atom line. It's like the original i7 fine tuned for power and running on 22nm. Very strong OoO pipeline design. The low power usage is great for a many core design because efficiently is more important than single-threaded performance.
But this is just a bunch of Atom cores....who wants that?
Saying they are "Atom cores" is meaningless, in this case they are Silvermont Atom cores which are based on the current i7 CPU architecture.
I really REALLY wouldn't want to do any heavy lifting with 'em.
And why wouldn't you want to do any "heavy lifting" with a package that puts 72 of them together to provide 3 teraflops of computing power?
Quake 3 raytraced though would shine.
You make a good point about use of a distributed index. But implementing a distributed index on separate machines will probably lead to far less RAM contention than implementing it on several cores that share one memory.
Parsing and reflow can be efficiently parallelized if sufficient parents have their heights determined by something other than their contents
Good luck determining the height of, say, a Slashdot comment (or, worse, a Slashdot page's entire comment section) other than by its contents. No, heights can't practically be fixed server-side because different machines have different viewport widths, different fonts installed, and different hinting algorithms that affect letter spacing. All of these affect how many lines a paragraph uses.
Even without that, couldn't the children each be processed in parallel for a good portion of them, but possibly needing updating for properties that have dependencies outside of themselves?
Only for documents that don't have floats and declare explicit heights for everything, which I don't think includes the majority of documents.
Say your parser has been parsing several kilobytes of a document, and it hits a quotation mark character (U+0022 or U+0027). Is it an open quote, starting a string, or a close quote, which means throw out everything it has parsed so far and treat it as the end of a string?
Nothing unreal exists, by definition.
Except, of course, for Unreal and other games using Unreal Engine.
No, anytime you are doing multiple things at once you are better off with more cores. If you are watching a video
Foreground application. I'll grant that multiple cores help with decoding really big (1080p or bigger) video, but so does a specialized H.264-specific core or moving half of the decoder to OpenCL.
or playing music
Background application. On an Intel Core CPU, decoding music uses so little CPU power nowadays that it stays within single digit percent utilization of a core. Even on the puny little Atom N450 CPU (1 core, 2x SMT) in my four-year-old Dell netbook, I just measured VLC playing an ogg file at 15% of one half-core.
or encoding/decoding content like running a media server
You said the S word. When a "server" enters the picture, I agree that larger core counts become easier to justify, as background processing begins to dominate. But I'd like to see statistics on how popular PC-based home media servers are in the first place.
You are trying to generalize it for all use cases but not all use cases are the same
I'm trying to find what use cases are most common because economies of scale benefit the most common use cases.
Good luck getting both browser makers and web site publishers to adopt a PNG variant using lz4. The biggest thing that led to PNG adoption in the first place was Unisys's LZW patent assertion. Besides, even after decompression, PNG decoding has a filtering phase where each line depends on the line above it. That can be parallelized by adding unfiltered lines at compression time at the cost of compression ratio.
Generalizing only slightly: a single processor chasing pointers will have a hard time maxing out the DDR throughput, although it will definitely be memory bottlenecked due to latency. Multiple processors all doing the same thing on the same memory will not, as a result, compete for bandwidth. Instead, their requests will execute in turn in the DDR
Won't the DDR take "50 to 150 cycles" to service each request? Or is there some sort of pipelining going on, where the DDR can take a request every 10 cycles but have a whole bunch of queued requests in flight? To take an analogy between DDR and that other DDR, are the requests like a column of arrows on the screen, where I see each arrow a measure before I have to hit it?
Besides, in a RAM latency-bound situation, there's little benefit of multiple full hardware cores over the virtual cores in a simultaneous multithreading architecture such as Intel's Hyper-Threading Technology or the "modules" that AMD introduced with Bulldozer. Furthermore, keeping all these requests in flight requires some sort of synchronization among threads, which when implemented wrong introduces plenty of locking overhead.
as to how many NSA backdoors this will feature.
Looks like the intention is to make more efficient supercomputers.
While only tangentially referenced in that article, Knights Landing may be orders of magnitude more power efficient than current supercomputer cores.
XML is a known as a key material required to create SMD: Software of Mass Destruction
Informative? Really? Because I have to throw a "Citation needed" here as from what I've seen Silvermont is merely Saltwell with some OoO bolted on to try to fix how long certain macro-ops took to go through the pipeline.
I've checked a dozen articles and NOTHING about the new Atom being based on i7, in fact if true this would reverse almost 30 years of history as Intel has always been VERY protective of its top o' the line chips and sells low end chips highly crippled.
ACs don't waste your time replying, your posts are never seen by me.
I can't wait to see what SGI do with this chip :)
Max.