Intel's Knights Landing — 72 Cores, 3 Teraflops

Imagine by Konster · 2014-01-04 11:14 · Score: 3, Funny

Imagine a Beowulf cluster of these!

Re:Imagine by Anonymous Coward · 2014-01-04 11:26 · Score: 0

Only if this time Intel manages to produce more than one CES sample.
Re:Imagine by Anonymous Coward · 2014-01-04 11:41 · Score: 1

That would be... one unit?
Re:Imagine by imevil · 2014-01-04 11:56 · Score: 1

It wouldn't be very different than the most powerful supercomputer in the world: http://top500.org/system/177999
Re:Imagine by dreamchaser · 2014-01-04 12:03 · Score: 0

But does it run Linux?
Re:Imagine by Anonymous Coward · 2014-01-04 13:32 · Score: 0

I'd rather imagine one screaming "I'll bite your legs off!!".
Re:Imagine by Anonymous Coward · 2014-01-04 19:06 · Score: 3, Funny

Forget about Linux! With this baby, I can finally run Crysis.
Re:Imagine by Anonymous Coward · 2014-01-05 00:35 · Score: 0

A decent cluster would have at least a couple of racks. A few hundred thousand cores or so.
Re:Imagine by Anonymous Coward · 2014-01-05 06:53 · Score: 0

Is that a promise name? Or*s*laughter that it did NOT provide in time? DJB
Re:Imagine by Anonymous Coward · 2014-01-05 06:57 · Score: 0

mesh, really?! OK
Re:Imagine by Anonymous Coward · 2014-01-05 09:40 · Score: 0

It's "very different TO", you stupid American...
How did I know you were American?
Re:Imagine by Anonymous Coward · 2014-01-05 13:14 · Score: 0

It's "very different TO", you stupid American... How did I know you were American?
No, that's "SO different to" You stupid AC (and I think I'll join the fold there..)
Re:Imagine by doccus · 2014-01-05 13:18 · Score: 1

It's "very different TO", you stupid American... How did I know you were American?
No, that's "SO different to" You stupid AC (and I think I'll join the fold there..)
..or "that different from", which is in fact the best correction.. and I'll take the high road here, in opposition to the first AC respondant..
Re:Imagine by Anonymous Coward · 2014-01-05 14:34 · Score: 0

Knight s _Korner_? It was not a knight unless you gave him such advantage erroneously.

Yay more cores that I won't be using much of! by Press2ToContinue · 2014-01-04 11:18 · Score: 0, Troll

Because you can never have too many cores that you aren't using most of the time.

How about more speed? Or is that too hard?

--
Sent from my ENIAC

Re:Yay more cores that I won't be using much of! by Frosty+Piss · 2014-01-04 11:20 · Score: 1, Insightful

Because you can never have too many cores that you aren't using most of the time.
Ask the NSA, they might have a (SECRET) opinion on that.

--
If you want news from today, you have to come back tomorrow.
Re:Yay more cores that I won't be using much of! by Anonymous Coward · 2014-01-04 11:23 · Score: 3, Insightful

Yes, it's too hard. The future is in concurrency. The actor model will probably take off since it's easy to pick up and use.
Re:Yay more cores that I won't be using much of! by icebike · 2014-01-04 11:35 · Score: 2, Insightful

Because you can never have too many cores that you aren't using most of the time.
How about more speed? Or is that too hard?
Pretty sure it wasn't meant for you (or me).

--
Sig Battery depleted. Reverting to safe mode.
Re:Yay more cores that I won't be using much of! by koan · 2014-01-04 11:38 · Score: 1

Doesn't multi core imply more speed, if not by clock then by efficiency?

--
"If any question why we died, Tell them because our fathers lied."
Re:Yay more cores that I won't be using much of! by godrik · 2014-01-04 11:41 · Score: 1

That's an hpc processor. You are unlikely to deploy that on classical desktop/laptop for a while. Think about it as a classical coprocessor.
Re:Yay more cores that I won't be using much of! by H0p313ss · 2014-01-04 11:49 · Score: 4, Insightful

Because you can never have too many cores that you aren't using most of the time.
How about more speed? Or is that too hard?
Pretty sure it wasn't meant for you (or me).
However, for servers, including hypervisors, it would be very interesting. There are lots of client/server products that scale better with more cores.

--
XML is a known as a key material required to create SMD: Software of Mass Destruction
Re:Yay more cores that I won't be using much of! by Anonymous Coward · 2014-01-04 11:52 · Score: 0

Yes, that really sucks that you will be forced to buy one of these things.
Re:Yay more cores that I won't be using much of! by dreamchaser · 2014-01-04 12:05 · Score: 1

It depends on the use case. There are many applications where this would shine. Sure if you want to play Quake 3 Arena it's not going to give you much at all, but if you're doing parallel processing for scientific or engineering applications this would rock.
Re:Yay more cores that I won't be using much of! by morcego · 2014-01-04 12:59 · Score: 5, Funny

Because you can never have too many cores that you aren't using most of the time.
Install McAfee Antivírus, and problem solved: no more unused cores.

--
morcego
Re:Yay more cores that I won't be using much of! by Entropius · 2014-01-04 15:12 · Score: 1

This isn't intended for you if you can't think of what to do with all those cores.
This is for the high performance physics folks to whom the difference between 16 cores, 256 cores, and maybe even 8192 cores is a line in a config file.
It's also for the folks developing 24 megapixel RAW files (which Nikon's cheapest SLR spits out these days), where splitting the image into 64 sectors is no more difficult than splitting it into four, or for the folks doing video encoding which is pretty trivially parallelizable.
Most of the times that I can think of where I'm truly waiting on my computer to do something that's limited by the number of flops that can be brought to bear, more cores is just as good as more speed.
Re:Yay more cores that I won't be using much of! by hairyfeet · 2014-01-04 15:53 · Score: 0

But this is just a bunch of Atom cores....who wants that? I've had plenty of Intel Atoms and AMD Bobcats go through the shop and...while they are good at your basic websurfing and the Bobcats make good media tanks thanks to the better GPUs....sigh...I really REALLY wouldn't want to do any heavy lifting with 'em.
Maybe things have changed since i was working SMB but back in my day the servers were doing heavy number crunching, something I wouldn't want to do on an Intel Atom, i don't care how many of them you jam together. Now maybe if you had a Xeon to pass off the heavy jobs to...but in that case an ARM chip to do the light work would make more sense. Sigh, sounds like another "tech demo for shit you'll never see IRL" like their Laredo or whatever it was called.

--
ACs don't waste your time replying, your posts are never seen by me.
Re:Yay more cores that I won't be using much of! by unixisc · 2014-01-04 16:30 · Score: 1

Where are you getting Atom cores from? I read up QPI, which this design will be using, and that is used only w/ Xeons and i7s. So this chip looks pretty much like the successor to Xeons and i7s, and will probably be seen either in servers, or in Mac Pros, but not likely in your average laptop, much less tablet.
Re:Yay more cores that I won't be using much of! by The+Grim+Reefer · 2014-01-04 17:01 · Score: 1

Because you can never have too many cores that you aren't using most of the time.
Sure you can, and 640 is obviously the threshold. Nobody would ever need more than that.
Re:Yay more cores that I won't be using much of! by Krishnoid · 2014-01-04 17:40 · Score: 2

Pretty sure it wasn't meant for you (or me).
Obviously -- 64 cores should be enough for any one person.
Re:Yay more cores that I won't be using much of! by Mr+Z · 2014-01-04 18:09 · Score: 1

Did you miss the part in the article about 512-bit AVX and being able to do 32 double precision floating point operations per clock? Or the other part about running four-way SMT to hide memory system latency? Or the other, other part about 128 byte (1024-bit) L1D to CPU bandwidth?
These ain't plain ol' Atom processors.
For HPC workloads, these seem to be right up the alley of "heavy lifting."

--
Program Intellivision!
Re:Yay more cores that I won't be using much of! by smash · 2014-01-04 18:59 · Score: 1

For GPUs, until we have one core per pixel for ray-tracing, we're nowhere near the number of cores we could use without even trying too hard.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:Yay more cores that I won't be using much of! by Guy+Harris · 2014-01-04 19:57 · Score: 3, Informative

Where are you getting Atom cores from?
From this Extremetech article, which has a slide speaking of the Knights Landing processor architecture having "up to 72 Intel Architecture cores based on Silvermont (Intel(R) Atom processor)"?
Re:Yay more cores that I won't be using much of! by lagomorpha2 · 2014-01-05 03:14 · Score: 1

Couple things:
1) The 22nm Silvermont Atom cores are a complete redesign over the badly aging atom cores. They are much more powerful, and much more powerful per watt.
2) These chips aren't going to replace CPUs, they are most likely going to compete with Nvidia Tesla - a PCIe card that highly parallel workloads can be offloaded to. One CUDA core isn't very powerful but stick 2688 actives ones on a chip and for certain tasks you have a lot of power. The K20X Tesla is capable of 1.3 trillion double-precision FLOPS so if Knights Landing can actually do 3 trillion it's no weak chip.
Re:Yay more cores that I won't be using much of! by HiThere · 2014-01-05 07:11 · Score: 1

Actually, if the price is right, it *is* meant for a project I'm working on. I don't need it this year, so it may actually be a possibility.
"Not this August ... but the year after that, or the year after that..."
(Well that's just free association. I left out the context of application, because it didn't apply.)

--

I think we've pushed this "anyone can grow up to be president" thing too far.
Re:Yay more cores that I won't be using much of! by Bengie · 2014-01-05 07:33 · Score: 3, Informative

You're both correct. The original Atom cpu was built separately and started before the i7 arch. The new Silvermont "Atom" is based a lot of the i7 arch. It is a huge upgrade to the Atom line. It's like the original i7 fine tuned for power and running on 22nm. Very strong OoO pipeline design. The low power usage is great for a many core design because efficiently is more important than single-threaded performance.
Re:Yay more cores that I won't be using much of! by exomondo · 2014-01-05 11:19 · Score: 1

But this is just a bunch of Atom cores....who wants that?
Saying they are "Atom cores" is meaningless, in this case they are Silvermont Atom cores which are based on the current i7 CPU architecture.

I really REALLY wouldn't want to do any heavy lifting with 'em.
And why wouldn't you want to do any "heavy lifting" with a package that puts 72 of them together to provide 3 teraflops of computing power?
Re:Yay more cores that I won't be using much of! by im_thatoneguy · 2014-01-05 14:32 · Score: 1

Quake 3 raytraced though would shine.
Re:Yay more cores that I won't be using much of! by H0p313ss · 2014-01-06 02:41 · Score: 1

Looks like the intention is to make more efficient supercomputers.
While only tangentially referenced in that article, Knights Landing may be orders of magnitude more power efficient than current supercomputer cores.

--
XML is a known as a key material required to create SMD: Software of Mass Destruction
Re:Yay more cores that I won't be using much of! by Anonymous Coward · 2014-01-06 16:50 · Score: 0

Very strong OoO pipeline design is little overstating it when the core uses in-order in places it is the right thing to do for power reasons.
Re:Yay more cores that I won't be using much of! by hairyfeet · 2014-01-06 22:56 · Score: 1

Informative? Really? Because I have to throw a "Citation needed" here as from what I've seen Silvermont is merely Saltwell with some OoO bolted on to try to fix how long certain macro-ops took to go through the pipeline.
I've checked a dozen articles and NOTHING about the new Atom being based on i7, in fact if true this would reverse almost 30 years of history as Intel has always been VERY protective of its top o' the line chips and sells low end chips highly crippled.

--
ACs don't waste your time replying, your posts are never seen by me.

No it cannot compete with nVidia and AMD/ATI by Rockoon · 2014-01-04 11:19 · Score: 1

Summary asks:

Will Intel use this as a platform to compete with nVidia and AMD/ATI on graphics?

...but first it says it has 16GB of eDRAM. The 128MB is eDRAM in their "Iris Pro" adds almost $200 to the price tag.

This chip is going to cost MANY THOUSANDS OF DOLLARS.

--
"His name was James Damore."

Re:No it cannot compete with nVidia and AMD/ATI by rsmith-mac · 2014-01-04 12:04 · Score: 5, Informative

"eDRAM" in this article is almost certainly an error for that reason.
eDRAM isn't very well defined, but it basically boils down to "DRAM manufactured on a modified logic process," allowing it to be placed on-die alongside logic, or at the very least built using the same tools if you're a logic house (Intel, TSMC, etc). This is as opposed to traditional DRAM, which is made on dedicated processes that is optimized for space (capacitors) and follows its own development cadence.
The article notes that this is on-package as opposed to on-die memory, which under most circumstances would mean regular DRAM would work just fine. The biggest example of on-package RAM would be SoCs, where the DRAM is regularly placed in the same package for size/convenience and then wire-bonded to the processor die (although alternative connections do exist). Conversely eDRAM is almost exclusively used on-die with logic - this being its designed use - chiefly as a higher density/lower performance alternative to SRAM. You can do off-die eDRAM, which is what Intel does for Crystalwell, but that's almost entirely down to Intel using spare fab capacity and keeping production in house (they don't make DRAM) as opposed to technical requirements. Which is why you don't see off-die eDRAM regularly used.
Or to put it bluntly, just because DRAM is on-package doesn't mean it's eDRAM. There are further qualifications to making it eDRAM than moving the DRAM die closer to the CPU.
But ultimately as you note cost would be an issue. Even taking into account process advantages between now and the Knight's Landing launch, 16GB of eDRAM would be huge. Mind bogglingly huge. Many thousands of square millimeters huge. Based on space constraints alone it can't be eDRAM; it has to be DRAM to make that aspect work, and even then 16GB of DRAM wouldn't be small.
Re:No it cannot compete with nVidia and AMD/ATI by im_thatoneguy · 2014-01-04 12:28 · Score: 2

An Nvidia Quadro card costs $8,000 for an 8GB card. I would consider $8,000 "many thousands of dollars". Nobody is suggesting Knights ____ is competing with any consumer chips CPU or GPU. I have a $1,500 Raytracing card in my system along with a $1,000 GPU as well as a $1,000 CPU. If this could replace the CPU and GPU but compete with a dual CPU system for rendering performance I would be a happy camper even if it cost $3-4k.
Re:No it cannot compete with nVidia and AMD/ATI by Anonymous Coward · 2014-01-04 13:50 · Score: 3, Informative

It may not be eDRAM, but I'm not sure what else Intel would easily package with the chip. We know the 128 MB of eDRAM on 22 nm is ~80 mm^2 of silicon, currently Intel is selling ~100 mm^2 of N-1 node silicon for ~$10 or less (See all the ultra cheap 32 nm clover trail+ tablets where they're winning sockets against allwinner, rockchip, etc., indicating that they must be selling them for equivalent or better prices than these companies.) By the time this product comes out 22 nm will be the N-1 node. In addition, a dedicated eDRAM chip is probably cheaper than a typical SoC/logic chip due to the smaller number of metal levels that are needed. Assuming N-1 node prices hold for a given area of silicon, 16 GB will need 12000 mm^2 of silicon (likely less as the current 128 MB die likely uses a not insignificant area for readout circuitry and PHY interface), coming out to around $1200. Add an extra $1000 for your actual processor and you have the current price of a low end Xeon Phi.
Re:No it cannot compete with nVidia and AMD/ATI by Anonymous Coward · 2014-01-04 16:28 · Score: 0

16 GB will need 12000 mm^2 of silicon
Nice theory.
But there is no way Intel is making a 10.9x10.9cm (4.3x4.3 inches) chip. The yield would be terrible unless the whole thing is redundant, not to mention the thermal and bonding issues with a chip that large (or even fitting it in a case).
More likely it is stacked silicon with TSVs on top or beside the CPU die in a MCM or wafer-bonded (2.5D) package (like Xilinx flipchip on metal carrier microbumping process). Either that or they've figured out how to get the density down significantly on eDRAM.
Re:No it cannot compete with nVidia and AMD/ATI by EngineeringStudent · 2014-01-04 16:32 · Score: 1

Disclaimer: this is my personal thoughts, I'm not qualified to have an actual opinion, and even if I did this wouldn't be it.
Shared interposer can be very good for memory. It is likely less good than shared die, but much better than shared on the motherboard through a buss. A typical die can have 30k solderballs on it. It goes through the interposer and a few thousand are exposed to the motherboard. This means the die talks to itself through the interposer. This also leads to a suggestion that a shared-interposer memory could be close in performance to on-die memory. It doesn't have to run through the interposer-motherboard solderballs, motherboard routing, or time-shares of an on-motherboard buss. It might even not be exposed to the motherboard - on interposer dedicated bandwidth could be extensively optimized.
This is, of course, conjecture on my part. Personally I like the idea of a (very good sized) cluster, optimized, sharing a single piece of silicon. I like that someone else worked out the kinks, and it will run pretty-much plug and play with my existing compiler/development environment. Personally, I think this thing looks like it has warp drive. :D

Programmability? by gentryx · 2014-01-04 11:20 · Score: 4, Informative

I wonder how nice these will be to program. The "just recompile and run" promise for Knights Corner was little more than a cruel joke: to get any serious performance out of the current generation of MICs you have to wrestle with vector intrinsics and that stupid in-order architecture. At least the latter will apparently be dropped in Knights Landing.

For what it's worth: I'll be looking forward to NVIDIA's Maxwell. At least CUDA got the vectorization problem sorted out. And no: not even the Intel compiler handles vectorization well.

--
Computer simulation made easy -- LibGeoDecomp

Re:Programmability? by godrik · 2014-01-04 11:46 · Score: 1

Actually the in-order execution isn't so much of a problem in my experience. The vectorization is a real problem. But you essentially have the same problem except it us hidden in the programming model. But the performance problem are here as well.
Anybody that understand gpu architecture enough to write efficient code there won;t have much problem using the mic architecture. The programming model is different but the key diffucultues are essentially the same. If you think about mic simd element as a cuxa thread, the programming different are mostly syntactical.
Re:Programmability? by imevil · 2014-01-04 11:59 · Score: 1

I wonder how nice these will be to program. The "just recompile and run" promise for Knights Corner was little more than a cruel joke
I tried recompiling and running some OpenCL code (that previously was running on GPUs). It was "just recompile and run" and the promises about performances were kept. But still, OpenCL is not what most people consider "nice to program".
Re:Programmability? by Anonymous Coward · 2014-01-04 13:16 · Score: 1

nVidia didn't sort out any vectorization problem, they dodged it completely.
They use a kind of super-scalar architecture (that they term MIMT) - many many (warp size is 32 for most consumer cards) dumb ALUs sharing a single instruction scheduler, and half as many FPUs as ALUs per warp - very much like AMDs current CPU architecture but with far more ALU/FPUs to schedulers - and a slightly different approach to VLIW superscalar architectures.
The downside with superscalar architectures though is of course, branching... if 'threads' within a superscalar architecture hit a divergent branch (eg: some go into if(a){} others go into else{}) you end up evaluating both branches sequentially (ouch!).
So while there's no vectorization problem, there certainly are others. Although I admit, I find instruction/branch divergence and gather/scatter coalescing to be easier to deal with mentally than manually dealing with alignment, swizzling/shuffling, etc.
Re:Programmability? by PhrostyMcByte · 2014-01-04 13:42 · Score: 2

Intel's AVX-512 is really friggin cool, and a huge departure from their SIMD of the past. It adds some important features -- most notably mask registers to optimally support complex branching -- which make it nearly identical to GPU coding so that compilers will have a dramatically easier time targeting it. I doubt it will kill discrete GPUs any time soon, but it's a big step in that long-term direction.
Re:Programmability? by Arakageeta · 2014-01-04 13:47 · Score: 2

It's not entirely syntactical. Local shared memory is exposed to the CUDA programmer (e.g., __sync_threads()). CUDA programmers also have to be mindful of register pressure and the L1 cache. These issues directly affect the algorithms used by CUDA programmers. CUDA programmers have control over very fast local memory---I believe that this level of control is missing from MIC's available programming models. Being closer to the metal usually means a harder time programming, but higher performance potential. However, I believe NVIDIA has made CUDA pretty programmer friendly, given the architectural constraints. I'd like to hear the opinions of MIC programmers, since I have no direct experience with MIC.
Re:Programmability? by godrik · 2014-01-04 14:07 · Score: 2

I don't understand. Mic is your regular cache based architecture. Accessing L1 cache in mic is very fast (3 cycle latency if my memory is correct). You have similar register constraints on mic with 32 512-bit vectors per thread(core maybe). Both architectures overlap memory latency by using hardware threading.
I programmed both mic and gpu, mainly on sparse algebra and graph kernels. And quite frankly there are differences but i find much more alike than most people acknowledge. The main difference in my opinion being the programming model where gpus are used with millions of threads while mic is better used with less number of threads and more of a work pool. Atomics are really fast in gpus and not so fast in mic. But you also have much more fine thread synchronization opportunities in mic whichsomewhat remove the interest of fast atomics.
Re:Programmability? by KonoWatakushi · 2014-01-04 22:37 · Score: 2

The recently revealed Mill architecture is far more interesting, and also offers a much more attractive programming model. It is a highly orthogonal architecture naturally capable of wide MIMD and SIMD. Vectorization and software pipelining of loops is discussed in the "metadata" talk, and is very clever and elegant. Those who have personally experienced the tedium of typical vector extensions will appreciate it all the more.
Based on sim, the creators expect an order of magnitude improvement of performance/W/$ over conventional architectures. (That being a very conservative estimate, with the goal being DSP like efficiency on general purpose code.) How they propose to achieve that is fascinating, but I'm even more exited about the potential impact on software development. The architecture described will vastly simplify the OS and compilers, and remove or greatly reduce a number of typical inefficiencies.
Knight's Landing leaves me with the usual impression of Intel using brute force and process superiority to retain the edge, and the Mill may offer enough of an architectural improvement to finally put and end to that. It would still be a long road, but it is a nice thought.
Re:Programmability? by gentryx · 2014-01-05 00:28 · Score: 1

Yeah, OpenCL is a different thing. But if you talk to laymen, they will often repeat the marketing speed that you take your OpenMP(!) code written for traditional multi-cores, recompile and enjoy... Not true, in my experience.

--
Computer simulation made easy -- LibGeoDecomp
Re:Programmability? by Anonymous Coward · 2014-01-05 00:34 · Score: 1

That IS vectorisation. CUDA just shifts the burden back to the programmer by requiring he write SPMD code to represent the vector operations. That is more convenient than writing vector intrinsics, though, and scales better. It's really just a wide vector unit like a beefed up SSE unit, though (beefed up because it deals better with alignment, gather, scatter and so on), it's not really "a lot of little cores". It's definitely not VLIW, and even when AMD's GPUs were VLIW it was a 4- or 5-way instruction issue of 64-way vector instructions, the vector unit was still quite clearly there.
Re:Programmability? by Anonymous Coward · 2014-01-06 11:35 · Score: 0

In CUDA, code is written at the thread-level (single program multiple data). With Intel, you're essentially coding at the warp or block level (single instruction multiple data). The models are pretty different, no?
In CUDA, barring any low-level hacks, you can only synchronize threads at the block level. This is because threads only have a coherent view of memory if they share a streaming multiprocessor (SM). This is essentially because they share an L1. The L1s are incoherent among the SMs. Contrast this to MIC, where all caches are snooped. This is easier to program. However, all that snooping costs both silicon and cycles. Because of these trade-offs, I suspect that the MIC will be easier to program but offer less peak performance.
You hinted at these costs with respect to slower atomics on MIC. I'd like to hear more of what you mean finer synchronization opportunities lessens the needs for fast atomics. This seems counter-intuitive. What synchronization methods are faster than an atomic test-and-set?

Not going to work by faragon · 2014-01-04 11:27 · Score: 1

In my opinion, the point of using x86 in order to reuse units from desktop/server CPUs is the base of these experiments. The counterpart is to deal with the x86-mess everywhere. This seems a desperate reaction to AMD's CPU+GPGPU, which also has drawbacks. I bet that both Intel and AMD prefer to keep memory controller as simpler as possible, having a confortable long-run, without burning their ships too early. E.g. a CPU+GPGPU in the same die, with 8 x 128 bit separate memory controllers configured as NUMA (i.e. without channel interleaving/bonding) would be much better, but it would imply expensive chips, motherboards, and more DRAM chips. So I bet we'll have same-die CPU+GPU plus simple memory controller (even with embbeded RAM in 3D package) for the next 20 years (consumer-grade products).

Re:Not going to work by cnettel · 2014-01-04 11:32 · Score: 1

20 years? I would be very doubtful regarding any prediction beyond the point where current process scaling trends finally break. Note, they might break the other way. Switching to a non-silicon material might allow higher frequencies which will again shift the tradeoff between locality, energy, and production cost. But there is no reason, no reason at all, to expect the current style to last for more than ten years, while you could be quite right that it could stay much the same for the next five years or so.
Re:Not going to work by viperidaenz · 2014-01-04 15:44 · Score: 1

8 128bit memory controllers? 1024 pins just for the memory bus? you've got to be kidding.
Re:Not going to work by Anonymous Coward · 2014-01-04 16:38 · Score: 1

DRAM is going serial. I doubt DDR4 will ship, or at least in any real quantity. HMC and other serial DRAM technologies wipe the floor with it, and DDR4 is not even in the market yet. 1024 signals (2048 for DS/CML + redundant links) across a wafer carrier to a 8 HPCMOS transceivers sounds practical. It's not even far removed from what Oracle is already doing with SPARC M6 and IBM is doing with POWER8 (hang on, POWER8 ALREADY HAS* 8x128bit memory controllers per MCM).
*Hasn't shipped yet, but announced at Hotchips last year.

Calm down by symbolset · 2014-01-04 11:34 · Score: 1

You aren't ever going to see this at Newegg.

--
Help stamp out iliturcy.

QPI? by Joe_Dragon · 2014-01-04 11:35 · Score: 1

To bad most Intel cpus don't have it and just about all 2011 boards don't use it. The ones that do use it for dual cpu.

To bad apple mac pro does not have this and is not likely to use any time soon.

Bitcoin/Litecoin Performance by Anonymous Coward · 2014-01-04 11:37 · Score: 1

Will this be any better on Bitcoin/Litecoin mining than anything else?

Re:Bitcoin/Litecoin Performance by Anonymous Coward · 2014-01-04 11:54 · Score: 0

Nope. Might be nice for some of the CPU intensive Altcoins, but not BTC or LTC.
Re:Bitcoin/Litecoin Performance by Anonymous Coward · 2014-01-04 12:40 · Score: 0

LTC can be ASIC mined just like BTC. It's only a matter of time. scrypt, like cryptocurrency in general relies on technology to prevent "attack". Yeah, that's like saying "hack in to my computer and you can have all the money in the world". Someone, somewhere, will do it because it's worth it. Cryptocurrencies are flawed from the start.
Re:Bitcoin/Litecoin Performance by Jeremy+Erwin · 2014-01-04 14:10 · Score: 1

Btc and ltc are best run on ASiCs or perhaps AMD GPUs
btc hardware
ltc hardware
Perhaps this chip will change things, but for now, cpu mining is pretty inefficient
Re:Bitcoin/Litecoin Performance by InvalidError · 2014-01-04 15:01 · Score: 4, Interesting

BitCoin has ASIC miners with ~10X the mining power per watt than most programmable alternatives such as GPGPU and FPGA. Anything less efficient than that is or soon will become cost-prohibitive to run.
The newer Bitcoin alternatives use memory-bound algorithms to prevent such a steep mining power escalation since memory capacity and bandwidth scale much more slowly than processing power but much more quickly on costs: with Bitcoin, increasing throughput by 10X simply required 10X the processing power but with the memory-bound alternatives, you also need 10X the RAM and 10X the memory bandwidth.
Re:Bitcoin/Litecoin Performance by dbIII · 2014-01-04 15:45 · Score: 1

Just do what the other people at the top end of the bitcoin scam are doing - infect PCs with malware and let those do the mining for you.

Any day now I'm expecting media reports about how there is this nefarious web site called slashdot that is the hub of the bitcoin scam. Let's not have that please.
Re:Bitcoin/Litecoin Performance by blackraven14250 · 2014-01-04 18:18 · Score: 1

Those Nvidia GPU numbers are outdated for CUDAMiner. There's been a substantial speedup for newer architectures recently - a 780, for example, can typically run substantially over 500 khash. At 250 watt TDP, that puts it in the ballpark of AMD cards for KHash per watt, even though the hardware investment per khash is substantially higher. It means that people who were buying one of the Nvidia cards anyway will still be on the profitable side of things for as long as ATI will be, but you wouldn't want to build mining rigs with Nvidia cards.

Form Factor by Anonymous Coward · 2014-01-04 11:38 · Score: 0

Does this by any chance look like either the chip from the Terminator films, or a Borg Cube?

I mean, they've been working on 3d chips for years, right?

Requires parallelism by tepples · 2014-01-04 11:42 · Score: 5, Informative

Multicore implies more speed only if your process is parallelized. Not all interactive processes on a single-user computer can be, wrote Amdahl.

Re: Requires parallelism by Anonymous Coward · 2014-01-04 11:55 · Score: 0

Good to know, I went from a 3 GHz dual to quad at the same frequency and it was faster at rendering video and 3D.
Re:Requires parallelism by Max+Threshold · 2014-01-04 12:17 · Score: 1

More to the point, even if they can be, there's no guarantee that they are. Most existing desktop software won't benefit much from multiple cores.
Re:Requires parallelism by Morpf · 2014-01-04 12:19 · Score: 2

I think you'd be surprised how many real world day to day task can be and are parallelized: almost everything concerning audio and video (images or movies), searching, analyzing, rendering web pages, compiling, computing physics and AI for games.
I can't think of one computing intensive day to day action that is not parallelized or wouldn't be easy to do so.
Re:Requires parallelism by Anonymous Coward · 2014-01-04 14:54 · Score: 0

Parsing cannot efficiently be done in parallel.
Re:Requires parallelism by Entropius · 2014-01-04 15:25 · Score: 1

Are there really that many interactive processes on a single-user computer that are
1) CPU-bound
2) not parallelizable
3) take long enough that waiting on them gets annoying?
I ask out of genuine curiosity; I can't think of many times when I wind up waiting on my computer to do anything that fits.
Re:Requires parallelism by Entropius · 2014-01-04 15:26 · Score: 1

edit: Compiling is one, definitely. Forgot about that.
Re:Requires parallelism by dbIII · 2014-01-04 15:26 · Score: 1

However there are plenty that are. Geophysics, biochemistry, engineering and even editing home movies.
I'd love some of these if they come off with better price/performance than an AMD system or even if they just beat it a lot on performance without being ten times the cost (sad state of the very top end of Xeons now).
Re:Requires parallelism by tepples · 2014-01-04 16:02 · Score: 1

If you have one HTML document, another HTML document in an IFRAME, four CSS style sheets, and four JavaScript programs, you can parse each on a core.
Re:Requires parallelism by Darinbob · 2014-01-04 16:28 · Score: 1

That many cores implies a big fat pipe to memory as well. Sure they have local cache but memory is going to be the bottle neck here even with parallelized computation.
Re:Requires parallelism by SuricouRaven · 2014-01-04 18:39 · Score: 1

This isn't for general-purpose use. See those floating-point specs? Those tell you exactly where this is going, because there is one class of user that just can't get enough floating point performance. Scientific HPC. Protein folding, molecular biology modeling, cosmological simulations, higher resolution seismic analysis, neural network simulation, quantum system modeling. All things that thrive on processing power. A chip like this could have a lot of scientific applications.
Re:Requires parallelism by smash · 2014-01-04 19:37 · Score: 1

Not only that, if this goes in a server, it is going to be handling multiple tasks for multiple users so more threads are an easy win.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:Requires parallelism by smash · 2014-01-04 19:39 · Score: 1

Encryption, video transcoding and other compression, compilation (inc. just in time) and other stuff we don't do yet because it is not really feasible (e.g., real time ray-tracing, etc.).

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:Requires parallelism by HiThere · 2014-01-05 07:27 · Score: 1

I'm guessing that they see the main market as being GM, Toyota, and Ford. I'm not saying they aren't eager for all the other uses, of course.

--

I think we've pushed this "anyone can grow up to be president" thing too far.
Re:Requires parallelism by Anonymous Coward · 2014-01-05 13:45 · Score: 0

Don't forget: Run ALL the unit tests. That's a problem that this chip could be highly suited to.
I've worked on projects with thousands of unit tests/integration tests/performance tests that can take hours to run. Also currently devs spend time making their unit tests run quickly, because we have to wait for them to run every 20-30 seconds. The faster your processor is, the faster your feedback loop. If you've got 2000 tests to run and it takes even 5 seconds to crunch through them all, human nature is going to either limit the test suite to only run a subset, or only run the tests at longer intervals than is optimal.
And that folks is how you continue to sell more cores to ITO's. And the best part is that it can scale up to one core per test, or more for parallel code under test. I'd say we can keep recycling this post well into the tens of thousands of cores range.

Yea, but can it run Crysis? by Anonymous Coward · 2014-01-04 11:58 · Score: 0

How long before it's a tiny little square I can drop into my mobo's cpu slot?

Andahl's Law by Anonymous Coward · 2014-01-04 12:07 · Score: 0

Actually Amdahl described the theoretically speed-up given the percentage of a process, that can be parallelized, and the number of processes. Let the latter go towards infinity and you get the maximum theoretical speed-up / minimum theoretical run-time.

Re:Andahl's Law by tepples · 2014-01-04 12:16 · Score: 1

In practice, the percentage of a process on a single-user system that can be parallelized is rarely 100 percent. If one holds the performance of a core constant, even a 1000 core system will still run as slowly as a 1 core system on the fraction that cannot.
Re:Andahl's Law by sjames · 2014-01-04 13:35 · Score: 2

Keep in mind, Amdahl's law can be expanded to all processes that make up a system. Even if you are using a single process program, it can benefit from not having to share it's core with the various system processes.
If the program uses async I/O, that counts as parallelism.

Unobtainium by Anonymous Coward · 2014-01-04 12:12 · Score: 3, Insightful

This is another one of those IBM things made from the most rare element in the universe: unobtainium. You can't get it here. You can't get it there either. At one point I would have argued otherwise, but no. Cuda cores I can get. This crap I can't get. Its just like the Cell Broadband engine. Remember that? If you bought a PS3, then it had a (slightly crippled) one of those in it. Except that it had no branch prediction. And one of the main cores was disabled. And you couldn't do anything with the integrated graphics. And if you wanted to actually use the co-processor functions, you had to re-write your applications. And you needed to let IBM drill into your teeth and then do a rectal probe before you could get any of the software to make it work. And it only had 256MB of ram. And you couldn't upgrade or expand that. With IBM's new wonder, we get the promise of 72 cores. If you have a dual-xeon processor. And give IBM a million dollars. And you sign a bunch of papers letting them hook up the high voltage rectal probes. Or you could buy a Kepler NVIDIA card which you can install into the system you already own, and it costs about the same as a half-decent monitor. And NVIDIA's software is publicly downloadable. So is this useful to me or 99.999% of the people on /.? No. Its news for nerds, but only four guys can afford it: Bill G., Mark Z., Larry P. and Sergey B..

Re:Unobtainium by Guy+Harris · 2014-01-04 12:42 · Score: 1

This is another one of those IBM things made from the most rare element in the universe: unobtainium
Presumably meaning "this is like those IBM things", given that, while the first word of the title begins with "I", it doesn't have "B" or "M" following it, it has "n", "t", "e", and "l", instead.
Re:Unobtainium by im_thatoneguy · 2014-01-04 14:34 · Score: 1

This is x86. Theoretically your program already runs on this. You don't have to rewrite your entire application to run on CUDA.
Re:Unobtainium by radarskiy · 2014-01-04 15:22 · Score: 2

"Its news for nerds, but only four guys can afford it: Bill G., Mark Z., Larry P. and Sergey B."
I would rather have that market than all of the rest.
Re:Unobtainium by aliquis · 2014-01-04 19:17 · Score: 1

It's Intel.
Re:Unobtainium by Anonymous Coward · 2014-01-05 08:19 · Score: 0

He forgot the salient advice from the New York historical archives: don't cross the streams, at least without locking first.
Re:Unobtainium by Arakageeta · 2014-01-05 15:51 · Score: 1

What made Cell a nightmare to program for was the SPU's local store. The local store is great for performance, but a pain to program since the programmer had to explicitly move data back and forth between main memory and the local store (hardware designers back then all thought compilers could solve their problems for them--see Itanium). MIC is cache coherent. All memory references are snooped on the bus(es). MIC programmers don't have to worry about what's loaded in memory and what is not. An instruction merely has to dereference a memory address, and the MIC hardware will be happy to go fetch the needed data for you, automagically. It was not so with Cell.
Re:Unobtainium by Anonymous Coward · 2014-01-09 09:45 · Score: 0

Imagine... if I could just go into my email account and SEARCH for Knight's Landing and HAVE THE SLASHDOT EMAIL THAT BROUGHT ME HERE SHOW UP???!!!! I kept the page open, otherwise... the Slashdot news email was no longer there and this page would become just a figment of my imagination. It is the problem with actually NEEDING KNIGHTS TO LAND IN NEW YORK CITY to bring the place back to the XXI century and the USA that these eastern egg promise titles just, well, vanish with less noise than 9/11 did.
Re:Unobtainium by kry73n · 2014-01-10 00:20 · Score: 1

hardware designers back then all thought compilers could solve their problems for them--see Itanium
Can you provide a source for this claim? Who thought this can be solved by compilers and what was their take on this?

Embarrassingly parallel by tepples · 2014-01-04 12:20 · Score: 4, Informative

You saw a speed-up because video and 3D are in a class of problems that are very easy to parallelize. So is decompressing all the images in an HTML document. Laying out the document, on the other hand, isn't so easy to parallelize, if only because every floating box theoretically affects all the boxes that follow it.

Re:Embarrassingly parallel by gnasher719 · 2014-01-05 10:22 · Score: 1

You saw a speed-up because video and 3D are in a class of problems that are very easy to parallelize [wikipedia.org]. So is decompressing all the images in an HTML document. Laying out the document, on the other hand, isn't so easy to parallelize, if only because every floating box theoretically affects all the boxes that follow it.
But there is a lot of work to be done in parallel for every box before you can actually start the layout. Get the text, get the fonts, do all kinds of height and width calculatons, find possible break points and so on.
Re:Embarrassingly parallel by tepples · 2014-01-05 16:05 · Score: 1

But there is a lot of work to be done in parallel for every box before you can actually start the layout. Get the text, get the fonts, do all kinds of height and width calculatons, find possible break points and so on.
Height and width calculations depend on the width of surrounding boxes and the height and width of any floating elements that happen to precede a particular element. Even the possible break points depend on the floating elements, as hyphenation needs to be more aggressive on text lines that have been narrowed by floating elements.

mhmm by Anonymous Coward · 2014-01-04 12:21 · Score: 0

implies it, sure.

fuck crysis by Anonymous Coward · 2014-01-04 12:28 · Score: 0

but a stacked CPU cube built only for watercooling (coolant would travel throughout the cube) would be damn spiffy.

read IBM was pondering somewhere along these lines a while back.

Bigger than the town by Anonymous Coward · 2014-01-04 12:30 · Score: 0

That chip is probably 3 times the size of Knights Landing. Seriously it might be time for a new naming scheme.

Re:Bigger than the town by Penguinshit · 2014-01-04 20:36 · Score: 1

Winterfell?

--
I have something in common with Stephen Hawking...

How does the intercommunication work? by Animats · 2014-01-04 12:33 · Score: 4, Informative

OK, we have yet another mesh of processors, an idea that comes back again and again. The details of how processors communicate really matter. Is this is a totally non-shared-memory machine? Is there some shared memory, but it's slow? If there's shared memory, what are the cache consistency rules?

Historically, meshes of processors without shared memory have been painful to program. There's a long line of machines, from the nCube to the Cell, where the hardware worked but the thing was too much of a pain to program. Most designs have suffered from having too little local memory per CPU. If there's enough memory per CPU to, well, run at least a minimal OS and some jobs, then the mesh can be treated as a cluster of intercommunicating peers. That's something for which useful software exists. If all the CPUs have to be treated as slaves of a control machine, then you need all-new software architectures to handle them. This usually results in one-off software that never becomes mature.

Basic truth: we only have three successful multiprocessor architectures that are general purpose - shared-memory multiprocessors, clusters, and GPUs. Everything other than that has been almost useless except for very specialized problems fitted to the hardware. Yet this problem needs to be cracked - single CPUs are not getting much faster.

Re:How does the intercommunication work? by dbIII · 2014-01-04 15:49 · Score: 1

Historically, meshes of processors without shared memory have been painful to program
Which is why we don't see those GPU cards in absolutely every place where there is a massively parallel problem to solve. Even 8GB is not enough for some stuff and you spend so much time trying to keep the things fed that the problem could already be solved on the parent machine.
Re:How does the intercommunication work? by Anonymous Coward · 2014-01-04 16:05 · Score: 0

Well that's a lovely comment, but this *is* in fact a shared-memory multiprocessor machine; all 72 cores on the die share the same 16GB on-chip RAM (and any other external RAM hooked up to it), so fundamentally it's just programming a 72-way cache-coherent SMP (well ok, NUMA) machine, similar to programming your regular SMP Intel machine, like a common-or-garden laptop or server. There are some quirks, like there only being 36 L2 caches on the chip, but your compiler should take care of figuring out the best strategy to deal with that once the appropriate support is added.
Re:How does the intercommunication work? by joib · 2014-01-04 19:47 · Score: 4, Informative

The mesh replaces the ring bus used in the current generation MIC as well as mainstream Intel x86 CPU's. Each node in the mesh is 2 CPU cores and L2 cache. The mesh is used for connecting to the DRAM controllers, external interfaces, L3 cache, and of course, for cache coherency. The memory consistency model is the standard x86 one. So from a programmability point of view, it's a multi-core x86 processor, albeit with slow serial performance and beefy vector units.
Re:How does the intercommunication work? by Animats · 2014-01-06 07:11 · Score: 1

Mod parent up. He's right. I thought this was one of Intel's experimental exotic machines. (Whatever happened to Intel's non-shared memory part built for academic research a few years ago?) But it's more vanilla, a big collection of x86 Atom CPUs on one chip.
Re:How does the intercommunication work? by dwater · 2014-01-07 00:38 · Score: 1

A quick scan of the second page of the linked article suggests to my "I know very little about this" mind that it is a CC-NUMA like design akin to that which SGI has been using for decades in its systems that scale to quite large numbers - iirc 2048 cores (hyperthreaded) with 64TB RAM. I guess it's a problem that is solved to some degree - ie, yes it can still be an issue that some applications are sensitive to and can require some specialist programming, especially to get the best performance from, but it can also present a massive amount of RAM to a single application. I imagine it works well for applications that process in a pipeline too with data passed through the system in an orderly manner, though I can't think of anything that has quite *that* many steps.
http://www.sgi.com/products/servers/uv/configs.html

--
Max.

Confusingly many items by Anonymous Coward · 2014-01-04 12:36 · Score: 0

The article appears to be about rumors mangled together. Socketed chip versus PCIe board, fabric over the PCIe versus over the host processor, while forgetting the chip-integrated possibility. If the DDR4 rumor is correct, it simply suggests that the coming Xeon sockets Intel use for serving HPC and low to mid-end server markets (E5) utilize 6 directly connected memory channels with maximum DIMM size of 64 GB.

I fail to see parallelism in CSS flow by tepples · 2014-01-04 12:44 · Score: 3, Insightful

I think you'd be surprised how many real world day to day task can be and are parallelized: [...] searching

I thought searching a large collection of documents was disk-bound, and traversing an index was an inherently serial process. Or what parallel data structure for searching did I miss?

rendering web pages

I don't see how rendering a web page can be fully parallelized. Decoding images, yes. Compositing, yes. Parsing and reflow, no. The size of one box affects every box below it, especially when float: is involved. And JavaScript is still single-threaded unless a script is 1. being displayed from a web server (Chrome doesn't support Web Workers in file:// for security reasons), 2. being displayed on a browser other than IE on XP, IE on Vista, and Android Browser <= 4.3 (which don't support Web Workers at all), and 3. not accessing the DOM.

compiling

True, each translation unit can be combined in parallel if you choose not to enable whole-program optimization. But I don't see how whole-program optimization can be done in parallel.

Re:I fail to see parallelism in CSS flow by msobkow · 2014-01-04 12:48 · Score: 1

High performance RDBMS indexes do indeed parallelize scans and index searches.

--
I do not fail; I succeed at finding out what does not work.
Re:I fail to see parallelism in CSS flow by msobkow · 2014-01-04 12:49 · Score: 1

WTF is going on here? I typed "engines", not "indexes".
Is slashdot now EDITING posts before publishing them, or is Firefox screwing with me?

--
I do not fail; I succeed at finding out what does not work.
Re:I fail to see parallelism in CSS flow by Anonymous Coward · 2014-01-04 13:15 · Score: 2, Funny

Are you in Colorado?
Re:I fail to see parallelism in CSS flow by Anonymous Coward · 2014-01-04 15:00 · Score: 0

Actually, I remember seeing at least one talk on research into parallel rendering of web pages. I don't remember exactly how it worked, but they had to accept that in the worst case, parallelism would gain them nothing for the reasons you state. Web browser rendering is just a high value enough application that dealing with the awful lack of parallelizability in general is worth the trouble.
Re:I fail to see parallelism in CSS flow by Morpf · 2014-01-04 15:21 · Score: 1

I think you'd be surprised how many real world day to day task can be and are parallelized: [...] searching
I thought searching a large collection of documents was disk-bound, and traversing an index was an inherently serial process. Or what parallel data structure for searching did I miss?
Searching a large collection of non-indexed documents from disk is likely disk-bound, yes - except you somehow formulated a very complex search or stream from multiple disks at a time - but maybe you are searching data already in RAM. Traversing an index isn't necessarily a serial process, depending on your data structure. There are parallel implementations for binary and red-black trees, as far as I know. Or one could simply use a forest of as many trees as one has searching threads. (will get worse performance when using less threads than trees). If you only have a sorted list or array you could use a parallel search. If your data is not indexed you are likely to be faster with multiple threads (if there is no other bottle neck like, for example, disk throughput). Maybe you are searching multiple things at the same time (like a string in authors and contents of e-mails) or you are searching with multiple parameters (filetype [type], last access after [date], string in content [foo]) where not all parameters are indexed.

rendering web pages
I don't see how rendering a web page can be fully parallelized. Decoding images, yes. Compositing, yes. Parsing and reflow, no. The size of one box affects every box below it, especially when float: is involved. And JavaScript is still single-threaded unless a script is 1. being displayed from a web server (Chrome doesn't support Web Workers in file:// for security reasons), 2. being displayed on a browser other than IE on XP, IE on Vista, and Android Browser <= 4.3 (which don't support Web Workers at all), and 3. not accessing the DOM.
I never stated that my problems are 100% parallelizable. ;) Parsing: Why not? Reflow: And if I have multiple boxes at the same layer? At least as long the dimension are fixed or bounded some parallel processing could be possible, if it would benefit I can't tell.
Often enough there is more than one page opened at a time. With every open page the likelihood of executing multiple JavaScripts rises and with multiple pages getting rendered at the same time you can use parallelism, too.

compiling
True, each translation unit can be combined in parallel if you choose not to enable whole-program optimization. But I don't see how whole-program optimization can be done in parallel.
Many steps can be parallelized, not all, as you pointed out. And even than I am not sure if there wouldn't be a solution for whole-program / link-time optimization, but I'm no professional concerning compiler building. And even then: I happen to compile multiple binary files with one run of make most of the time, so using multiple threads is for free (there is a reason make has the -j option).
Re:I fail to see parallelism in CSS flow by Anonymous Coward · 2014-01-04 15:25 · Score: 0

Confronted with the possibility that either you mistyped a word or what is essentially malicious magic happened, you chose malicious magic. That's amazing indicative of your thought processes.
Re:I fail to see parallelism in CSS flow by tepples · 2014-01-04 15:47 · Score: 1

If your data is not indexed you are likely to be faster with multiple threads (if there is no other bottle neck like, for example, disk throughput).
Or RAM throughput.

Parsing: Why not?
Sure, the browser can parse multiple CSS files or multiple HTML files or multiple JavaScript files at once, just as the browser can decode multiple images at once. But the parser for a single file is a state machine. In order to "drop the needle" halfway into the byte stream and start parsing the second half on the second core, the parser would first have to know what state the state machine was in as of halfway into the stream. What parallelization were you thinking of?

And if I have multiple boxes at the same layer?
Once the browser finishes loading stylesheets, one or more of the stylesheets can alter the "same layer" status. And even then, web browsers are a huge target for exploits, and it might be harder to prove correctness and thread safety of parallel layout code than of serial layout code, especially when it has to fall back and start over serially should a float end up discovered.

Often enough there is more than one page opened at a time.
But only one page at once is the visible tab in the frontmost window. True, it's possible to have multiple pages visible in multiple browser windows on a desktop operating system, but browsers for mobile devices have only one window. And if people who post comments to Slashdot stories about Windows 8 or web site styling are to be believed, "most" users maximize all browser windows even on desktop operating systems.

And even then: I happen to compile multiple binary files with one run of make most of the time, so using multiple threads is for free (there is a reason make has the -j option).
So do I, and this works well for projects that don't use whole-program optimization.
Re:I fail to see parallelism in CSS flow by msobkow · 2014-01-04 16:24 · Score: 1

Confronted with the fact that I proof-read my post, hit submit, and the comment posted was different than what I'd just proof-read, yes, I do presume something is fucking with the system.

--
I do not fail; I succeed at finding out what does not work.
Re:I fail to see parallelism in CSS flow by KingMotley · 2014-01-04 17:52 · Score: 1

I don't see how rendering a web page can be fully parallelized.
Parsing and reflow can be efficiently parallelized if sufficient parents have their heights determined by something other than their contents, for example, say the if the main parts of the documents have heights explicitly defined. Then they can be processed in parallel efficiently. Even without that, couldn't the children each be processed in parallel for a good portion of them, but possibly needing updating for properties that have dependencies outside of themselves? Yes, floats can cause some issues, but I rarely use them, and absolutely never without a direct parent that limits that problem anyhow (overflow:hidden). As such, the parent with overflow:hidden, and given a specific size wouldn't affect the following siblings no matter what floats the children did and can safely be done in parallel.
Re:I fail to see parallelism in CSS flow by Mr+Z · 2014-01-04 18:16 · Score: 1

I thought searching a large collection of documents was disk-bound, and traversing an index was an inherently serial process. Or what parallel data structure for searching did I miss?

Two words: Map Reduce
Thank goodness Google doesn't linearly search the entire Internet every time I make a search. It'd get exponentially slower every year...

--
Program Intellivision!
Re: I fail to see parallelism in CSS flow by Anonymous Coward · 2014-01-04 20:16 · Score: 0

Cognitive error. You did not really see the rihgt word.
Just like in the previous sentence.
Nothing unreal exists, by definition.
Re:I fail to see parallelism in CSS flow by Anonymous Coward · 2014-01-05 03:00 · Score: 0

Thought exercise: Calculate SHA-256 of a long'ish document in sublinear time. Number of cores is not limited.
Once you have that worked out you can parallelize parsing. Only, this time it's realistic ;)
Re:I fail to see parallelism in CSS flow by Anonymous Coward · 2014-01-05 03:20 · Score: 0

traversing an index was an inherently serial process.
A single lookup in isolation will take O(log n) steps. This is pretty much as fast as any parallel process can achieve. In this case you can even do it with a single processor and you are complaining?
Something tells me you have a bigger problem (program) than just a lookup.
Re:I fail to see parallelism in CSS flow by Anonymous Coward · 2014-01-05 07:30 · Score: 0

In order to "drop the needle" halfway into the byte stream and start parsing the second half on the second core, the parser would first have to know what state the state machine was in as of halfway into the stream.
Split the grammar into inner and outer parts. Tokenization and brace matching are perfectly possible to parallelize due to small state space. You shouldn't have much of a problem parallelizing the rest after that unless the tree is somehow really unbalanced.
Re:I fail to see parallelism in CSS flow by exomondo · 2014-01-05 12:57 · Score: 1

and traversing an index was an inherently serial process.
Traversing an index? What do you mean by "index"? Traversing data structures is certainly not inherently serial at all.

I don't see how rendering a web page can be fully parallelized.
Probably because he didn't say "fully parallelized", in fact "fully parallelized" would be that every element of something is parallel, which isn't ever the case, you are always going to have a serial element somewhere.

Parsing and reflow, no.
Why not? Just because there is a serial part to the problem doesn't mean you can't leverage parallel processing. If I have a set of elements that need to be processed serially that doesn't mean the task of processing of each element couldn't utilize multiple cores. Or even that I couldn't process each element to a degree that it provides enough information for another thread/process to start processing the next element.
Re:I fail to see parallelism in CSS flow by tepples · 2014-01-05 15:39 · Score: 1

Traversing an index? What do you mean by "index"? Traversing data structures is certainly not inherently serial at all.
So how would parallel speedup help with a linked list, a tree, or a hash table? Perhaps rotations of a balanced search tree could run in a separate thread, but that's about all I can think of. I admit to having fallen victim to the fallacy of lack of imagination.

Just because there is a serial part to the problem doesn't mean you can't leverage parallel processing
What I meant was that the serial part dominates long before 72 cores are used.

Or even that I couldn't process each element to a degree that it provides enough information for another thread/process to start processing the next element.
Which brings in synchronization delays when your parser thread has to pass such "enough information" about each element to another thread.
Re:I fail to see parallelism in CSS flow by exomondo · 2014-01-05 15:51 · Score: 1

So how would parallel speedup help with a linked list, a tree, or a hash table? Perhaps rotations of a balanced search tree could run in a separate thread, but that's about all I can think of. I admit to having fallen victim to the fallacy of lack of imagination.
Because - and perhaps I misspoke when I repeated the term 'traverse' - you are never just traversing a tree, you are always doing some sort of processing on the nodes (which can be passed to worker threads) and every branch in a tree is an opportunity to have other threads take up the traversal task.

What I meant was that the serial part dominates long before 72 cores are used.
In theory you could have a problem that is 99.99999% percent serial, but in practice you probably don't.

Which brings in synchronization delays when your parser thread has to pass such "enough information" about each element to another thread.
Of course you will have synchronization delays at some point but if that is the case then another possible way is that the parser thread goes through processing each element and when it has enough information to process the next element it pops the current one onto a worker queue for completion by another thread. There are many ways to go about this if you have a bit of imagination but it depends on your target hardware, your specific problem and the context of the problem (you may not want to thrash all your available resources on this one problem leaving nothing for other concurrent tasks).
Re:I fail to see parallelism in CSS flow by tepples · 2014-01-05 16:31 · Score: 1

I guess I'm so clueless that I don't know how much overhead there is in A. adding a single item to a worker queue and then B. waiting for a worker queue to have an item and retrieving that item. Having worked in a language with a Global Interpreter Lock, requiring the use of multiple processes rather than multiple threads for any parallelism beyond making I/O nonblocking, hasn't made the problem any better. But my intuition is that if one stage of processing spends all its time waiting to acquire the locks needed to add an item to the next stage's worker queue, that could negate the gains of fine-grained parallelism.
Re:I fail to see parallelism in CSS flow by exomondo · 2014-01-05 17:29 · Score: 1

I guess I'm so clueless that I don't know how much overhead there is in A. adding a single item to a worker queue and then B. waiting for a worker queue to have an item and retrieving that item.
Very little, it's just a threadsafe queue and a barrier for workers.

But my intuition is that if one stage of processing spends all its time waiting to acquire the locks needed to add an item to the next stage's worker queue, that could negate the gains of fine-grained parallelism.
Of course, but the idea that one stage would be spending "all its time waiting to acquire the locks needed to add an item to the next stage's worker queue" makes little sense in practice. It seems you're working more on theoretical situations that could arise from poor design, poor threading constructs and poor schedulers rather than actual issues or experience.

Intel's version of a IBM/Sony Cell CPU by Required+Snark · 2014-01-04 13:51 · Score: 2

This will have the same useability as the CELL CPU. From TFA:

Second, while Knights Landing can act as a bootable CPU, many applications will demand greater single threaded performance due to Amdahl’s Law. For these workloads, the optimal configuration is a Knights Landing (which provides high throughput) coupled to a mainstream Xeon server (which provides single threaded performance). In this scenario, latency is critical for communicating results between the Xeon and Knights Landing.

So there will be a useful mainstream CPU closely coupled with a bunch of vector oriented processors that will be hard to use effectively. (Also from TFA).

The rumors also state that the KNL core will replace each of the floating point pipelines in Silvermont with a full blown 512-bit AVX3 vector unit, doubling the FLOPs/clock to 32.

So unless there is a very high compute to memory access ratio this monster will spend most of it's time waiting for memory and converting electrical energy to heat. Plus writing software that uses 72 cores is such a walk in the park...

--
Why is Snark Required?

Re:Intel's version of a IBM/Sony Cell CPU by dbIII · 2014-01-04 15:52 · Score: 1

Plus writing software that uses 72 cores is such a walk in the park
Some stuff actually is. It depends on how trivially parallel the problem is. With some stuff there is no interaction at all between the threads - feed it the right subset of the input - process the data - dump it out.
Re:Intel's version of a IBM/Sony Cell CPU by Anonymous Coward · 2014-01-04 17:17 · Score: 0

I'm sure they will come up with a demo program that shows a fantastic speedup. I'm sure whoever had to write the demo program had to do a lot of analysis and tuning to get the fantastic speedup. And I'm sure the amount of work you need to do to get that speedup will be glossed over in all public marketing materials and research papers.
If they are able to sufficiently parallelize and pipeline the accesses to the DRAM, they will be able to feed the cores well enough to get a good speedup. Maybe lengthen the cache line so that you implicitly prefetch and get enough data to chomp on per cache miss.
Re:Intel's version of a IBM/Sony Cell CPU by Anonymous Coward · 2014-01-04 20:06 · Score: 0

>So unless there is a very high compute to memory access ratio this monster will spend most of it's time waiting for memory and converting electrical energy to heat.
Nope. With 16 GB of memory on-package, that's 200 MB of memory per core. Contrast that with Cell's 256 kB per SPE on-die.
Re:Intel's version of a IBM/Sony Cell CPU by cnettel · 2014-01-04 23:24 · Score: 1

Plus writing software that uses 72 cores is such a walk in the park
Some stuff actually is. It depends on how trivially parallel the problem is. With some stuff there is no interaction at all between the threads - feed it the right subset of the input - process the data - dump it out.
More importantly, for some applications a limited amount of very low-latency/high-bandwidth communication is enough to give spectacular performance improvements. In those cases, the fully coherent x86 model, kept up by this kind of cache and memory architecture, will do wonders, compared to an MPI implementation with weaker individual nodes, but also possibly against (current) nVidia offerings. It's harder to say how it will stack up against Maxwell.
Re:Intel's version of a IBM/Sony Cell CPU by Anonymous Coward · 2014-01-06 03:32 · Score: 0

If your problem scales to 72 cores then it probably scales to 384 cores and you can use an existing GPU language. You only need KL when you need more flexible IPC and memory access patterns than a GPU.
Re:Intel's version of a IBM/Sony Cell CPU by dwater · 2014-01-07 00:42 · Score: 1

I sensed a high degree of speculation in TFA, so I wouldn't be so fast with those 'will's...not that I know either way.

--
Max.

QPI bad for NVIDIA by Anonymous Coward · 2014-01-04 13:56 · Score: 0

PCIe bandwidth is often the major performance bottleneck to GPGPU applications (i.e., NVIDIA's CUDA). If we suppose that Knights Landing's compute performance is always _worse_ than what NVIDIA can offer, Intel, with QPI, could still push NVIDIA out of HPC for bandwidth-constrained applications. That is, NVIDIA could have the greatest GPU, but if their GPU is stuck on pokey PCIe, Knight's Landing + QPI could still offer better performance. It would be neat if Intel could let NVIDIA on the QPI bus, but that might not even be technically feasible even if Intel were willing to license the technology (as if that would ever happen). AMD also has HyperTransport/Fusion to leverage for their GPGPU solutions. NVIDIA needs to find something better than PCIe.

Then a dual core should be plenty by tepples · 2014-01-04 13:57 · Score: 1

Even if you are using a single process program, it can benefit from not having to share it's core with the various system processes.

Then there's not really much of a benefit to adding more than a dual core, which will probably end up running the application with which the user is interacting on one core and the background applications and system processes on the other. To go beyond that, you have to either parallelize the application, run more than one CPU-bound application at once (which most desktop PC users tend not to do), or run more than one user at once using dual monitors, dual keyboards, and dual mice (which most desktop PC operating systems tend not to support).

If the program uses async I/O, that counts as parallelism.

That counts as being I/O bound, and if all your processes are I/O bound, even a single core with simultaneous multithreading is enough.

Re:Then a dual core should be plenty by phantomfive · 2014-01-04 14:28 · Score: 1

Then there's not really much of a benefit to adding more than a dual core, which will probably end up running the application with which the user is interacting on one core and the background applications and system processes on the other.
Wow, I just realized you are right, and got depressed.

--
"First they came for the slanderers and i said nothing."
Re:Then a dual core should be plenty by sjames · 2014-01-04 14:36 · Score: 2

Not necessarily. A process could be CPU bound and prefer not to make it worse by also waiting for I/O completion. Let another core drive the filesystem and talk to the block device (which might be a soft RAID).
My system frequent;y enough is busy compressing video or doing large compiles in the background while I work in the foreground.
If all you're doing is word processing, single thread speed isn't all that important either since it's mostly waiting for you to press a key.
Re:Then a dual core should be plenty by Anonymous Coward · 2014-01-04 14:47 · Score: 1

I've encountered instances where multiple cores helped the user experience because it distributed the CPU use by the malware on the machine.
Re:Then a dual core should be plenty by sjames · 2014-01-04 14:53 · Score: 1

Sad but true.
Re:Then a dual core should be plenty by tepples · 2014-01-04 15:03 · Score: 2

To go beyond that, you have to either parallelize the application, run more than one CPU-bound application at once (which most desktop PC users tend not to do)
Let another core drive the filesystem and talk to the block device (which might be a soft RAID).

My system frequent;y enough is busy compressing video or doing large compiles in the background while I work in the foreground.
Then you're not most users. I was under the impression that most users tend not to use soft RAID 5/6 or CPU-intensive file systems, compress large videos, or do large compiles. I too compress video and do compiles, but geeks such as you and myself are edge cases.
Re:Then a dual core should be plenty by dbIII · 2014-01-04 15:34 · Score: 1

Yes, we get it, it's not for everyone and there is still a lot of braindead software stuck in 1995 that should be multithreaded (due to the problem it is solving) but isn't, let alone the stuff that is going to be stuck on one thread forever. Meanwhile at least some stuff can use this thing.
For a lot of people bucketloads of memory is a better deal than large numbers of cores. For others there is not problem pegging all cores at 100% for days on end.

There's been this sort of discussion here ever since the two socket boards for the celeron 300A were cheap. For some people it was overkill, but for me I never wanted to go back to one core.
Re:Then a dual core should be plenty by dbIII · 2014-01-04 15:37 · Score: 1

Most users are not going to be able to justify the likely expense of the things the article is about. The edge cases are the ones that will think it's worth putting up the cash.
Re:Then a dual core should be plenty by sjames · 2014-01-04 15:45 · Score: 1

While most people probably don't do large compiles, the video compression is just for shows I record. In my case, it just happens to happen on a PC, others might use an appliance for that. My filesystem isn't particularly CPU intensive but no filesystem uses zero cycles.
The people not doing any of that probably wouldn't fully utilize the full speed of a single core either, so it's not much of an issue.
Re:Then a dual core should be plenty by tepples · 2014-01-04 16:09 · Score: 1

the video compression is just for shows I record.
For shows you record from OTA, cable, or satellite, it doesn't have to be significantly faster than real time. How many tuners does your PC have? You could put one video encode on each core, plus another core for the audio encodes. But then I confess ignorance as to how much CPU power it takes to encode video at, say, full 1080p/24.

My filesystem isn't particularly CPU intensive but no filesystem uses zero cycles.
True, which is why the file system would probably run on the second core of a dual core along with the rest of the "system processes".
Re:Then a dual core should be plenty by The+Grim+Reefer · 2014-01-04 17:06 · Score: 1

While most people probably don't do large compiles, the video compression is just for shows I record. In my case, it just happens to happen on a PC, others might use an appliance for that. My filesystem isn't particularly CPU intensive but no filesystem uses zero cycles.
The people not doing any of that probably wouldn't fully utilize the full speed of a single core either, so it's not much of an issue.
You seem to be forgetting gamers.
Re:Then a dual core should be plenty by Darinbob · 2014-01-04 17:43 · Score: 1

Memory access is also a shared resource here, so it can be treated as I/O in a way since it requires going through a shared bus. Some local calculation can be done with local instruction/data cache but there is going to be a lot of banging on that bus. Some modern popular languages are really terrible at making effective use of caches (heavily templated stuff for example). That many cores using a typical asynchronous threading model (ie, the stuff people run on PCs) will be a waste of the chip, better to use it for planned parallelized code, but in that case it's even better to use more explicitly parallel systems ala super computers or parallel computers.
Re:Then a dual core should be plenty by sjames · 2014-01-04 18:02 · Score: 1

Games are parallelizable and since game programming has a long history of going to extremes for performance, parallel code isn't much of an ask.
Re:Then a dual core should be plenty by sjames · 2014-01-04 18:08 · Score: 1

I haven't studied it very carefully, but I do know that 5 cores was significantly faster than real time (re-encodung MPEG2 to mp4) but 2 cores falls behind even if I don't do anything else. There's a lot of trade off there, if I accept less compression or lower quality video, it needs less CPU to accomplish it.
Re:Then a dual core should be plenty by chmod+a+x+mojo · 2014-01-04 19:14 · Score: 1

But then I confess ignorance as to how much CPU power it takes to encode video at, say, full 1080p/24
A lot. 1080P on a Core2Duo running 3.17Ghz, with H.264 you are looking at 3-5 FPS at medium quality and using both cores, the i5's didn't get significantly ( read us-ably, they are faster ) faster, and I doubt the i7's did either.
With H.264 the more cores the better, you get roughly 60-80% speedup per core added. This translates to higher quality encodes at realtime if you start throwing more cores at the encoder.
Does everyone need this? Hell no, but to those of us that could use more cores it would be awesome. As an added bonus think of a render farm of these used for movies, things that took a year to render on hundreds of single / dual core machines could be done in days or weeks on the same number of machines.... or you could just go with smaller networks of render boxes making it easier to manage.

--
To err is human; effective mayhem requires the root password!
Re:Then a dual core should be plenty by smash · 2014-01-04 19:36 · Score: 1

I'd say probably 80-90% of normal users are doing video transcoding these days.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:Then a dual core should be plenty by sjames · 2014-01-04 22:32 · Score: 1

Many multi-core CPUs also have multiple memory channels. If the program has good data locality, that's a big win.
Re:Then a dual core should be plenty by Rockoon · 2014-01-05 02:19 · Score: 1

I'd say probably 80-90% of normal users are doing video transcoding these days.
The bubble you are living in seems very opaque.

--
"His name was James Damore."
Re:Then a dual core should be plenty by exomondo · 2014-01-05 11:39 · Score: 1

Then there's not really much of a benefit to adding more than a dual core, which will probably end up running the application with which the user is interacting on one core and the background applications and system processes on the other. To go beyond that, you have to either parallelize the application, run more than one CPU-bound application at once (which most desktop PC users tend not to do), or run more than one user at once using dual monitors, dual keyboards, and dual mice (which most desktop PC operating systems tend not to support).
No, anytime you are doing multiple things at once you are better off with more cores. If you are watching a video or playing music or encoding/decoding content like running a media server that is serving other devices while you are doing other things on your system then more cores is going to speed up the system. You are trying to generalize it for all use cases but not all use cases are the same, it all depends on what you are doing with your system, how many programs you are running at once and the ability of those programs to exploit parallel processing power.
Re:Then a dual core should be plenty by exomondo · 2014-01-05 11:48 · Score: 1

But plenty of people run streaming media servers (XBMC, PS3 media server, iTunes, VideoLAN) that do video/audio conversion or transcode media on the fly and certainly plenty of people have DVRs (multicore CPUs and applications is in no way limited to PCs) with the capability to encode/decode, record and playback multiple video streams at a time.
Re:Then a dual core should be plenty by exomondo · 2014-01-05 12:07 · Score: 1

For shows you record from OTA, cable, or satellite, it doesn't have to be significantly faster than real time. How many tuners does your PC have? You could put one video encode on each core, plus another core for the audio encodes. But then I confess ignorance as to how much CPU power it takes to encode video at, say, full 1080p/24.
Well obviously that recommendation is pretty baseless, unless the specific CPU's individual cores are sufficient to handle encoding video at that resolution for the specific chosen codec in realtime then you may need more cores, and that's before considering if the encoder is optimized for parallel processing and that I/O is not the bottleneck.
Re:Then a dual core should be plenty by smash · 2014-01-05 13:11 · Score: 1

Not sure what you mean by that, but every man and his dog has a smartphone and a lot of them shoot video with it to either upload elsewhere (transcoded on their phone) or play with on their PC. Heaps of people run home servers to serve video to PS3, etc. They may not know it as transcoding, but there's a lot of format/codec shifting out there.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.

AT&T Network On A Chip by Anonymous Coward · 2014-01-04 14:08 · Score: 0

Why stop at a meger 72 cores!

Why not, A Core For Every Instruction!

Then the "nasty" moves to the "central message passing unit parallelized on vectors Ernestine" what a laugh-n!

This of course ignites the "Core Arms Race"!

Then the "USA" vs "World" launch into a "Core Per Syllable Arms Race"!

The butt-zillions of oil-dollars will be spent wildly at an Above Fast N Furious Pace to the ultimate CORE THOUGHT paradigm shift.

Yet, there is the "NSA Wild Card."

[Dealer} What's your bet?

[High Roller] Hold! [;-)]

[Dealer] "Pervert fucker!" ;-)

Imagine, 2 by Futurepower(R) · 2014-01-04 14:28 · Score: 2, Funny

Imagine having one of those in your smartphone. You could answer text messages 1 microsecond faster. The battery life wouldn't be good.

Re:Imagine, 2 by aliquis · 2014-01-04 16:47 · Score: 1

Even better imagine them at Wall street. The economy would grove like.. nothing?! +5% / year just because of new processors?

72 slow as shit cores by Anonymous Coward · 2014-01-04 14:31 · Score: 0

is still gonna be horrible except in very specific, and highly multithreaded scenarios... some shared parts between cores on a tile.. simply looks like intel is copying amd's current architecture (intel 'tile' = amd 'module'), mixing in an advance of their own (the on-die 'edram'), and throwing more cores at it than even amd has tried.

to keep thermals in check for a single socket package, you're looking at something that draws less than 2w per core under load.. that's a little less than what a current silvermont n2805 mobile celeron draws (4.3w for 2 cores, 1 thread per core), which delivers a whopping ..wait for it.. 228 passmarks per core. the original atom 230 scores a 301 passmarks, and we know how fast that piece of shit was. compare to a current low end desktop chip that runs win vista or newer with 'usable' performance when paired with 4gb ram that scores about 2000 passmarks on 2 cores

Re:72 slow as shit cores by Anonymous Coward · 2014-01-05 00:43 · Score: 0

Intel or AMD, whatever. They are both copying TilePro64 which did this six years ago.

compilation often not just one single program by Chirs · 2014-01-04 14:52 · Score: 1

In my experience, most cases where compilation takes a long time involve multiple compilation units. I have a fair bit of experience with compiling linux distros professionally...when you're building glibc and the kernel and five hundred other packages it'll use as many cores as you can throw at it.

-fwhole-program --combine by tepples · 2014-01-04 15:15 · Score: 1

True, each translation unit can be combined in parallel if you choose not to enable whole-program optimization. But I don't see how whole-program optimization can be done in parallel.

In my experience, most cases where compilation takes a long time involve multiple compilation units.

That's what I said. But a lot of times nowadays, the compiler is set to perform whole-program optimization on release builds to try to save cycles even in calls from a function in one translation unit of a program to a function in another. Mozilla's Firefox web browser, for example, is so big that it can't be compiled with profile-guided whole-program optimization on 32-bit machines. But I'll grant that a multi-core CPU speeds up debug builds.

when you're building glibc and the kernel and five hundred other packages

Not many people are maintainers of an operating system distribution.

cool by Anonymous Coward · 2014-01-04 15:28 · Score: 0

when is this new CPU going to be available at Frys

Reflow in web browsers and word processors by tepples · 2014-01-04 16:00 · Score: 1

As I wrote elsewhere: laying out a web page that includes float-styled elements. That fits 1) and 2), and it fits 3) on a netbook or tablet with an ARM or Atom processor. Or repaginating a document in a word processor, which happens every time the user enters enough text to make the current paragraph one line longer, deletes enough to make it one line shorter, or changes the styling of any span of text. Repagination may affect figures, references to page numbers elsewhere in the document, etc. Repaginating text after the visible page can be deferred unless there's a "See page n" elsewhere in the document, which may even end up triggering repagination of text before the edit if the new page number has more or fewer digits than the old page number.

Also the PBKDF2 key stretching used to connect to a WPA2 access point, when run on a similarly slow machine.

Also compressing a large still image. I don't see how the DEFLATE codec used by, say, PNG can be parallelized.

Re:Reflow in web browsers and word processors by godrik · 2014-01-04 17:31 · Score: 1

There are parallel strategies to do some of these things. As far as I know text layouting is mostly done with dynamic programming algorihtms. These algorithms are usually very parallel.
Even if they are not, you can always use some kinds of speculative algorithms to deal with that. You assume the 3 most likely scenario for line 1 and while line 1 is being processed, you layout line 2 multiple times using different assumption on line 1. This will not give you perfect parallelism but it will give you some improvement.
As for image decompression. I am not sure how PNG works, but parallel compression and decompression of text, images, videos and sound streams as kept the parallel programming community active for a very long time. If PNG can really not be parallelized (which I doubt, but it might not be efficient) then we will move to different formats with time which supports parallel processing.
Many algorithms have been claimed as impossible or difficult to parallelize. With time, they are falling the one after the other. The one that are inherently sequential are often replace with "just as good" other algorithm which can be performed in parallel.
Have faith in the parallel computing scientific community, we are here to help! :)
Re:Reflow in web browsers and word processors by KingMotley · 2014-01-04 17:37 · Score: 1

I don't see how the DEFLATE codec used by, say, PNG can be parallelized.

There are multiple ways to implement the deflate codec, some compress better than others on different source materials. The best implementations would try multiple variants in parallel and discard all but the best result. For current examples, running PNGOUT, OptiPNG, and DeflOPT in parallel for each PNG and discard the other two, but better approaches trying more variants for even better (albeit less) results are possible and likely to produce even smaller results.
Re:Reflow in web browsers and word processors by UnknownSoldier · 2014-01-05 11:05 · Score: 1

> I don't see how the DEFLATE codec used by, say, PNG can be parallelized.
That's because deflate is a crappy compression codec.
Try a compression algorithm like lz4 which is a) lossless, b) multi-core, and c) fast
http://code.google.com/p/lz4/

QPI vs PCIe? by unixisc · 2014-01-04 16:27 · Score: 1

I just read up QPI on wiki, and it's a point to point processor interconnect, which replaces the front side bus in Xeon and certain desktop platforms - presumably the cores i7. PCIe, OTOH, is a serial computer expansion bus standard, which can take in things like graphics cards, SSDs, network cards and other such peripheral controllers. I just don't see how QPI is any sort of a replacement for PCIe. That would almost be like arguing for PCIe being superseded by USB4 or something.

Essentially, QPI is Intel's equivalent of the HyperTransport that AMD uses. The PCIe part of it is completely separate - I doubt one will have QPI graphics cards or SSDs

Re:QPI vs PCIe? by godrik · 2014-01-04 16:54 · Score: 1

API is not meant as a replacement for PCI-e. That's just the technology that links multiple processors together (and memory controllers). KNL is essentially the next generation MIC processor. The current generation is KNC which is a separate PCI-e card. I think it is in that sense that QPI replaces PCI-e.
Re:QPI vs PCIe? by Anonymous Coward · 2014-01-05 16:16 · Score: 0

This isn't a stretch at all. QPI and PCIe are both data buses. On the Intel architecture, the PCIe bus is even snooped to maintain memory coherency! It's really quite incredible: A memory write from a PCIe device can trigger an L1 cache eviction in a CPU! This is possible because PCIe bus traffic crosses over QPI when it reads/writes host/system/main memory. Hooking a GPU up to QPI would merely remove the middle main (the root PCIe controller). The GPU just needs to know how to "speak QPI" instead of PCIe.
To my knowledge, only HPC really stresses the PCIe bus today. Gaming does not (can anyone point to a benchmark where PCIe 3.0 really made a difference?). With Knights Landing on QPI, Intel could slow down PCIe advancement in order to make PCIe-base HPC solutions (i.e., NVIDIA's) less attractive, without harming Intel's non-HPC customers. That is, Intel could leverage it's control over the chipset to push NVIDIA out of HPC. I'd much rather see Intel and NVIDIA compete toe-to-toe on the virtues their compute architectures, not through technology licensing tricks.

ipad by goombah99 · 2014-01-04 16:43 · Score: 4, Funny

They tested this for the next ipad. While apple felt the 5 second battery life was too short to be practical, the beta testers were more concerned about the apple shaped 3rd degree burns imprinted on their thighs and palms

--
Some drink at the fountain of knowledge. Others just gargle.

Re:ipad by kthreadd · 2014-01-05 00:39 · Score: 1

the beta testers were more concerned about the apple shaped 3rd degree burns imprinted on their thighs and palms
Some people would see this as a feature.
Re: ipad by Macthorpe · 2014-01-05 01:56 · Score: 5, Funny

To be fair, Apple are very committed to branding.

--
"It does not do to leave a live dragon out of your calculations, if you live near him." - Tolkien
Re: ipad by goombah99 · 2014-01-05 04:02 · Score: 1

LMAO

--
Some drink at the fountain of knowledge. Others just gargle.

Perfect application by Anonymous Coward · 2014-01-04 17:30 · Score: 0

This will be awesome for animating all those tiles on the windows 8 home screen. Might not even be enough cores for that.

Wow. by Ralph+Spoilsport · 2014-01-04 17:43 · Score: 1

My slow ass typing in MS Word will be FASTER than ever!

--
Shoes for Industry. Shoes for the Dead.

Apparently ... by Press2ToContinue · 2014-01-04 18:01 · Score: 1

you aren't doing much on your computer. Try doing special effects graphics, or stock market analysis. Or even just start up an Android emulator - it's excruciatingly slow.

--
Sent from my ENIAC

audio by tleaf100 · 2014-01-04 23:40 · Score: 1

how would one of these do for software synths and audio processing,from my past use,these appear to use up any and everything you can throw in a pc,some of them need fat gpu's as well on top of as much cpu/ram as possible.

Anonymous Coward FTW by gentryx · 2014-01-05 06:53 · Score: 1

Thanks.

--
Computer simulation made easy -- LibGeoDecomp

Now that's a name... by Anonymous Coward · 2014-01-05 09:05 · Score: 0

Now, that's a name I've not heard in a long time. A long time.

Cores wait for their turn to read RAM by tepples · 2014-01-05 15:25 · Score: 1

You make a good point about use of a distributed index. But implementing a distributed index on separate machines will probably lead to far less RAM contention than implementing it on several cores that share one memory.

Re:Cores wait for their turn to read RAM by Mr+Z · 2014-01-05 16:03 · Score: 1

Well, even on a shared memory, certain data structures are latency bound, not throughput bound.
For example, consider a linked list. If none of the 'next' pointers are in cache, then you spend a full round-trip to DDR to get the next 'next' pointer. Depending on the machine, that could be anywhere from 50 to 150 cycles of latency, but not a huge hit on throughput.
Generalizing only slightly: a single processor chasing pointers will have a hard time maxing out the DDR throughput, although it will definitely be memory bottlenecked due to latency. Multiple processors all doing the same thing on the same memory will not, as a result, compete for bandwidth. Instead, their requests will execute in turn in the DDR, and you will be able to get some decent scaling up until the point where you have enough parallel requestors to start actually taxing the bandwidth.
If you bring disk accesses in the picture, you have some additional opportunities for scaling, if only some of the threads go to disk, while others hit in DDR. But, I grant that the crux of my argument assumes that accesses to DDR from a single thread bottleneck on latency, not throughput.

--
Program Intellivision!

Widths and fonts vary by tepples · 2014-01-05 15:31 · Score: 1

Parsing and reflow can be efficiently parallelized if sufficient parents have their heights determined by something other than their contents

Good luck determining the height of, say, a Slashdot comment (or, worse, a Slashdot page's entire comment section) other than by its contents. No, heights can't practically be fixed server-side because different machines have different viewport widths, different fonts installed, and different hinting algorithms that affect letter spacing. All of these affect how many lines a paragraph uses.

Even without that, couldn't the children each be processed in parallel for a good portion of them, but possibly needing updating for properties that have dependencies outside of themselves?

Only for documents that don't have floats and declare explicit heights for everything, which I don't think includes the majority of documents.

Long quoted strings by tepples · 2014-01-05 15:33 · Score: 1

Say your parser has been parsing several kilobytes of a document, and it hits a quotation mark character (U+0022 or U+0027). Is it an open quote, starting a string, or a close quote, which means throw out everything it has parsed so far and treat it as the end of a string?

Re:Long quoted strings by Anonymous Coward · 2014-01-05 22:58 · Score: 0

If we go that simple, the problem reduces to knowing whether there are even or odd number of quotes before that point in the program. This can be computed with a mod 2-version of parallel prefix sum algorithm which takes 2*log2(n) steps and 2*n work.
With a real language with multiple states (With C we'd have 12, assuming no tri- and digraphs) this becomes parallel prefix product which has the same complexity but uses a slower primitive operation. With enough processors this is still much faster than going through the whole document sequentially.
Yes, this is about 12 times slower than if you knew the state beforehand, but there are a number of optimizations after that which make the amount of total work not so different from sequential algorithm but they are not particularly relevant for the parallelization itself. (eg. only consider chars ", ', /, *, \, LF, and the first one of streaks of "something else")

It's unreal, it's an engine, and it exists by tepples · 2014-01-05 15:35 · Score: 1

Nothing unreal exists, by definition.

Except, of course, for Unreal and other games using Unreal Engine.

Economies of scale benefit common use cases by tepples · 2014-01-05 15:54 · Score: 1

No, anytime you are doing multiple things at once you are better off with more cores. If you are watching a video

Foreground application. I'll grant that multiple cores help with decoding really big (1080p or bigger) video, but so does a specialized H.264-specific core or moving half of the decoder to OpenCL.

or playing music

Background application. On an Intel Core CPU, decoding music uses so little CPU power nowadays that it stays within single digit percent utilization of a core. Even on the puny little Atom N450 CPU (1 core, 2x SMT) in my four-year-old Dell netbook, I just measured VLC playing an ogg file at 15% of one half-core.

or encoding/decoding content like running a media server

You said the S word. When a "server" enters the picture, I agree that larger core counts become easier to justify, as background processing begins to dominate. But I'd like to see statistics on how popular PC-based home media servers are in the first place.

You are trying to generalize it for all use cases but not all use cases are the same

I'm trying to find what use cases are most common because economies of scale benefit the most common use cases.

Re:Economies of scale benefit common use cases by exomondo · 2014-01-05 16:13 · Score: 1

Foreground application.
Nope, not necessarily.

Background application.
It doesn't matter that it's a background application, you can't have infinite background applications running on just one core. When I'm rendering a 3D scene it is the background application too. Encoding music and video for your portable devices from other formats or from physical media is also a background task, but one that is probably more intensive than a foreground web browsing task.

You said the S word. When a "server" enters the picture, I agree that larger core counts become easier to justify, as background processing begins to dominate. But I'd like to see statistics on how popular PC-based home media servers are in the first place.
Plenty of people use XBMC, iTunes, PS3 Media Server, Twonky, Windows Media Server, VideoLAN, Firefly, Quicktime, TVersity, etc... it isn't a niche market and such people wouldn't consider running these applications on their system to mean it's suddenly a "server".

I'm trying to find what use cases are most common because economies of scale benefit the most common use cases.
No you said 1 core for foreground and 1 core for background, but that is completely unjustified. Look at what you replied to:
Even if you are using a single process program, it can benefit from not having to share it's core with the various system processes.
So your use case is if the foreground process is single-threaded and all your background processing is single-threaded and non-instensive and the combination of it adds up to less than the processing power of one core - which itself is not a unit of measure. Speaking in terms of number of cores makes no sense anyway. It depends not only on the use cases but also the problems to be solved and the implementation of the applications to solve those problems.
Re:Economies of scale benefit common use cases by tepples · 2014-01-05 16:50 · Score: 1

It doesn't matter that [playing audio is] a background application, you can't have infinite background applications running on just one core.
Most PCs have only one sound card, and most people listen to one musical recording at once. More cores do make sense for people producing music, which might involve dozens of tracks feeding into a tree of DSP effects and mixers.

Encoding music and video for your portable devices from other formats or from physical media is also a background task, but one that is probably more intensive than a foreground web browsing task.
Encoding an entire album's worth of music is embarrassingly parallel: one core for each track. Video is a bit harder. First, what kind of video does a home user have the legal right to transcode "from other formats or from physical media" without unlawfully circumventing a technical measure? Second, video typically isn't stored with each chapter as a separate file; instead, chapter stops tend to be treated as cue points. So how is video encoding typically split up across threads?

But I'd like to see statistics on how popular PC-based home media servers are in the first place.
Plenty of people use XBMC, iTunes, PS3 Media Server, Twonky, Windows Media Server, VideoLAN, Firefly, Quicktime, TVersity, etc
I'd like to see statistics on "plenty", especially among people using them for real-time transcoding rather than just viewing on the local machine or doing something disk- or net-bound like sending raw files or remuxed streams of already-encoded video over the network. And if one PC in the house is the media server, the other PCs in the house don't also need to be the media server.

It depends not only on the use cases but also the problems to be solved
Sure, the processor of the featured article is massively multicore because it fits the nature of "the problems to be solved" in supercomputing. But to what extent do "the problems to be solved" on most home computers admit a likewise massively multicore design?
Re:Economies of scale benefit common use cases by exomondo · 2014-01-05 17:48 · Score: 1

Most PCs have only one sound card, and most people listen to one musical recording at once. More cores do make sense for people producing music, which might involve dozens of tracks feeding into a tree of DSP effects and mixers.
Obviously, and playing music is only one of virtually infinite possibilities for a multitude of background tasks.

Encoding an entire album's worth of music is embarrassingly parallel: one core for each track. Video is a bit harder. First, what kind of video does a home user have the legal right to transcode "from other formats or from physical media" without unlawfully circumventing a technical measure?
I don't know, frankly I don't really care. If you don't use all the cores of your CPU I also don't care.

Second, video typically isn't stored with each chapter as a separate file; instead, chapter stops tend to be treated as cue points. So how is video encoding typically split up across threads?
Encoding video is a very parallel process and extensively documented and certainly beyond the scope of a reply here. Google is really easy to use so I suggest you educate yourself on this. But you appear to be looking at it without any knowledge of the actual process of encoding or the concept that data can be split up or combined and just because it is presented to you in terms of "tracks" or "chapters" or "files" doesn't mean you have to process it that way.

I'd like to see statistics on "plenty"
Then perhaps you should study it before dismissing CPUs with more than 2 cores.

Sure, the processor of the featured article is massively multicore because it fits the nature of "the problems to be solved" in supercomputing. But to what extent do "the problems to be solved" on most home computers admit a likewise massively multicore design?
Most problems are able to exploit massively parallel processors it's just that typically it has been unnecessary to leverage such processors. If you don't need it then don't buy it, if it becomes cheaper than dual core someday then buy it because it's cheaper and pretend the redundant cores aren't there if you really have to.
Perhaps you should research and provide a well-reasoned basis for your recommendations in future.

Adoption by tepples · 2014-01-05 16:09 · Score: 1

Good luck getting both browser makers and web site publishers to adopt a PNG variant using lz4. The biggest thing that led to PNG adoption in the first place was Unisys's LZW patent assertion. Besides, even after decompression, PNG decoding has a filtering phase where each line depends on the line above it. That can be parallelized by adding unfiltered lines at compression time at the cost of compression ratio.

How is DDR pipelined? by tepples · 2014-01-05 16:19 · Score: 1

Generalizing only slightly: a single processor chasing pointers will have a hard time maxing out the DDR throughput, although it will definitely be memory bottlenecked due to latency. Multiple processors all doing the same thing on the same memory will not, as a result, compete for bandwidth. Instead, their requests will execute in turn in the DDR

Won't the DDR take "50 to 150 cycles" to service each request? Or is there some sort of pipelining going on, where the DDR can take a request every 10 cycles but have a whole bunch of queued requests in flight? To take an analogy between DDR and that other DDR, are the requests like a column of arrows on the screen, where I see each arrow a measure before I have to hit it?

Besides, in a RAM latency-bound situation, there's little benefit of multiple full hardware cores over the virtual cores in a simultaneous multithreading architecture such as Intel's Hyper-Threading Technology or the "modules" that AMD introduced with Bulldozer. Furthermore, keeping all these requests in flight requires some sort of synchronization among threads, which when implemented wrong introduces plenty of locking overhead.

Re:How is DDR pipelined? by Mr+Z · 2014-01-05 16:53 · Score: 1

Won't the DDR take "50 to 150 cycles" to service each request? Or is there some sort of pipelining going on, where the DDR can take a request every 10 cycles but have a whole bunch of queued requests in flight?

Actually, that's pretty much exactly how it works. If you have a bunch of independent requests to DDR—and by independent, I mean that the processor(s) do not stall waiting for the information from one request in order to make the next—then you can get multiple requests in flight and they can pipeline. Streaming works this way, for example. The STREAM benchmark is a textbook example of a benchmark dominated by throughput, where all the accesses are independent. For example, a[i] = b[i] + c[i] does not depend on a[i - K] = b[i - K] + c[i - K] or a[i + K] = b[i + K] + c[i + K] for any value of K in STREAM's "Add" loop. All four loops of the benchmark have that character. So as long as the processor can get enough work in-flight, it can get multiple cache misses outstanding to DDR. And if one processor and its caches have limited ability to 'execute ahead' like this, multiple processors (or multiple independent threads on the same processor) acting independently can fill in those gaps.
Linked list traversal results in a series of requests that are all dependent on each other. If all the requests miss the caches and must go out to DDR, then the CPU's performance is bounded by the round trip latency to DDR, not the DDR's throughput. Take a look at the linked list benchmarks in Ulrich Drpper's paper, "What Every Programmer Should Know About Memory." (Specifically, go down to section 3.3.2 on page 20.) Pay particular attention to Figure 3.15, Sequential vs. Random Read (for a single thread), and also compare to Figure 3.21 which shows multi-threaded random accesses for 1, 2, and 4 threads.
The paper might be a little old (it uses a Pentium 4 for its benchmarks, after all), but the principles remain true. I should know... part of my day job is as a memory system architect. :-)

--
Program Intellivision!

http://venturebeat.com/2014/01/05/nvidia-announces by Anonymous Coward · 2014-01-06 00:24 · Score: 0

http://venturebeat.com/2014/01/05/nvidia-announces-tegra-k1-a-super-mobile-chip-with-192-cores/

The mind boggles by ThatsNotPudding · 2014-01-06 00:48 · Score: 1

as to how many NSA backdoors this will feature.

SGI by dwater · 2014-01-07 00:20 · Score: 1

I can't wait to see what SGI do with this chip :)

--
Max.

Better or not than classical Cray? by Anonymous Coward · 2014-01-07 11:46 · Score: 0

Cray made a brand name of being THE Super-Computer. Now it is out of news scope, at least... Are these processors equivalent, better or still behind? Things started going confusing after Pentium and the sudden increase in clock speeds, but it was clear those were not supercomputers at all. In the eternal struggle to TRULY dedicate a computer to do a single thing...

Slashdot Mirror

Intel's Knights Landing — 72 Cores, 3 Teraflops

208 comments