A Co-processor No More, Intel's Xeon Phi Will Be Its Own CPU As Well
An anonymous reader writes "The Xeon Phi co-processor requires a Xeon CPU to operate... for now. The next generation of Xeon Phi, codenamed Knights Landing and due in 2015, will be its own CPU and accelerator. This will free up a lot of space in the server but more important, it eliminates the buses between CPU memory and co-processor memory, which will translate to much faster performance even before we get to chip improvements. ITworld has a look."
Moore's law is not coming back from the grave, or is it ?
Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
I thought that we already had GPUs embedded in CPUs. How embedding CPU inside GPU makes it so much different and breakthrough?
Knights Landing will be available as both an accelerator card and a standalone CPU with some sort of large high-speed memory pool on the die.
No kidding!!! What do you say at this point?
Good. The current generation Phi cards are a pain to administer. With luck the new generation will be more fully baked.
- very hot card, no fans
- depends on software to down throttle the cards (mine have hit 104C)
- stripped down OS running on the cards, poor user facing directions for the usage
Anyway, enough from me.
20 characters max for the password? How will I use my favorite poems as passwords?
For a Phi, the selling point is about ease of programming. The memory model of the accelerator card is a pain in the ass, making development more difficult. This on top of the fact that the administration of those are pretty limited and annoying. MPSS is crap for everyone, and one of the critical differences here is that the standalone accelerator might not require Intel to be the linux distribution curator anymore (they frankly suck pretty hard at it).
Intel having a standalone variant pretty much obviates the utility of an accelerator card model for all but perhaps the tiniest usage and makes things far more simpler. Trying to get the same workload to work across Phi and main CPUs is, in practice, more about trying to make the best of an awkward heterogeneous compute situation. While you still can and will run jobs heterogeneous if you do have both Phi and normal Xeon nodes (e.g. a top500 run and... well not much else), it is done using more typical methods of MPI.
In short, this move pretty much let's intel focus on the pieces they *are* good at (making a decent processor) and get away from the stuff they aren't so good at (pcie hosted device, linux distribution design, etc).
The 80486 was the first Intel processor with integrated coprocessor, coming at about €1000 (only know the DM price). There was a considerably cheaper version, the 80486SX "without" coprocessor (actually, the coprocessor was usually just disabled, possibly because of yield problems, and still took current).
One could buy an 80487 coprocessor that provided the missing floating point performance. Customers puzzled how the processor/coprocessor combination could be competitive without the on-chip communication of the 80486. The answer was that it did not even try. The "coprocessor" contained a CPU as well and simply switched off the "main" processor completely. It was basically a full 80486 with different pinout, pricing, and marketing.
It was probably phased out once the yields became good enough.
These processors are like an Intel version of Sun Niagara, but with wider vector. Actually, from an architectural perspective Xeon Phi (Larrabee) is pretty basic. They’re an array of 4-way SMT in-order dual-issue x86 processors, with 512-bit vector units. I think one of the major reasons Xeon Phi doesn’t compete well with GPUs on performance is that legacy x86 ISA translation engine taking up so much die area. Anyhow, so if you have a highly parallel algorithm, then Xeon Phi will be a boon for performance.
But as we know, there are numerous very important algorithms that are not parallelizable.
I could see using this, whereas I couldn't see myself using the card version. If the cost premium is reasonable this could be awesome for image processing. I have an image algorithm I use CUDA for and moving the data around consumes almost as much time as processing the data. If I had this in my servers I would have flexibility and much greater performance with this solution. --Robert
wouldn't an embedded in the cpu xeon phi version, lack the necessary GDDR4/5 which exists in the PCI-express card version with its 200-300GB/s of throughput, and be forced to just access the main computer RAM at about 40-50GB/s?
I don't know about Niagara's, but according to docs about Warps and half-warps, that's how Nvidia GPU run CUDA.
They keep cycling through 2 or 4 threads, to hide memory latency.
(Except that each thread it self runs on a wide SIMD instead of a normal CPU. So the final size of parallel execution [=threads] is the amount of wraps in parallel x size of the SIMD).
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]