Slashdot Mirror


Intel Announces New Enterprise Xeons, More Powerful Xeon Phi Cards

MojoKid writes "Intel announced a set of new enterprise products today aimed at furthering its strengths in the TOP500 supercomputing market. As of today, the Chinese Tiahne-2 supercomputer (aka Milky Way 2) is now the fastest supercomputer on the planet at roughly ~54PFLOPs. Intel is putting its own major push behind heterogeneous computing with the Tianhe-2. Each node contains two Ivy Bridge sockets and three Xeon Phi cards. Each node, therefore, contains 422.4GFLOP/s in Ivy Bridge performance — but 3.43TFLOPs/s worth of Xeon Phi. In addition, we'll see new Xeons based on this technology later this year, in the 22nm E5-2600 V2 family, with up to 12 cores. The new chips will be built on Ivy Bridge technology and will offer up to 12 cores / 24 threads. The new Xeons, however, aren't really the interesting part of the story. Today, Intel is adding cards to the current Xeon Phi lineup — the 7120P, 3120P, 3120A, and 5120D. The 3120P and 3120A are the same card — the 'P' is passively cooled, while the "A" integrates a fan. Both of these solutions have 57 CPUs and 6GB of RAM. Intel states that they offer ~1TFLOP of performance, which puts them on par with the 5110P that launched last year, but with slightly less memory and presumably a lower price point. At the top of the line, Intel is introducing the 7120P and 7120X — the 7120P comes with an integrated heat spreader, the 7120X doesn't. Clock speeds are higher on this card, it has 61 cores instead of 60, 16GB of GDDR5, and 352GBps of memory bandwidth. Customers who need lots of cores and not much RAM can opt for one of the cheaper 3100 cards, while the 7100 family allows for much greater data sets."

57 comments

  1. Programmers will be happy. by SuricouRaven · · Score: 4, Interesting

    The x64 Phi cards are a lot easier to program then GPUs. No need to jump through hoops with memory mapping, keep things in sync for SIMD processing or worry about running out of stack space when doing recursion.

    1. Re:Programmers will be happy. by Anonymous Coward · · Score: 4, Interesting

      If you are an assembly junkie I guess you are right. But I rather prefer the implicitly vectorized CUDA programming model than having to use vector intrinsics by hand. If you want to avoid explicit data transfers take a look at (https://code.google.com/p/adsm/). Moreover, the performance of current Xeon Phi boards is not on par with Kepler GPUs. But, finally, NVIDIA is facing some competition.

    2. Re:Programmers will be happy. by mwvdlee · · Score: 1

      How does the performance measure up to GPUs for TFLOPS/$$$?

      --
      Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
    3. Re:Programmers will be happy. by CSMoran · · Score: 1

      Does Intel's MKL support the Phis out of the box? It would be very convenient if, instead of having to re-write code for them, we could just use phi-capable BLAS and LAPACK.

      --
      Every end has half a stick.
    4. Re:Programmers will be happy. by Anonymous Coward · · Score: 0

      I think this really depends on what you're currently using a GPU for.

      If you're in one of these awkward emerging fields where people have jumped on the stream processing band wagon because the GPU buzzword said so - sure, going back to a more conventional processing architecture certainly makes sense.

      But most people using GPUs are actually doing soft-real-time signal processing or some other sort of highly parallel, short lived processing that's ideal for stream-processing architectures - which don't deal with mass amounts of data that exceeds the memory limits by orders of magnitude (which is roughly the point it becomes a mild pain), don't deal with deep stacks, or even highly divergent code paths.

      Apples and Oranges.

      Now when Intel come about with a hybrid approach that exposes multiple architecture paradigms in one (like AMD are doing) with a unified ISA and dynamic resource allocation between the sequential general purpose processors, mid-ground parallel SIMD, and massively parallel SIMT of threads running that ISA - I'll get excited.

      Until then, meh - x86 on a card with high bandwidth/latency RAM, yay....

    5. Re:Programmers will be happy. by dargaud · · Score: 3, Informative
      --
      Non-Linux Penguins ?
    6. Re:Programmers will be happy. by CSMoran · · Score: 1

      Excellent. Thank you.

      --
      Every end has half a stick.
    7. Re:Programmers will be happy. by _merlin · · Score: 1

      The "no-work" option is only useful if the bulk of the time in your code is in well-known algorithms that are implemented in Intel's library. Even going up to the "minimal work with Intel compiler" approach will require you to wrangle vector intrinsics manually to take advantage of these cores.

    8. Re:Programmers will be happy. by pla · · Score: 2

      How does the performance measure up to GPUs for TFLOPS/$$$?

      If you need double precision FP, you don't have a lot of alternatives.

      If you only need single or half precision, the Radeon 7990 rates at 7x the TFlops for about 15% of the price.

    9. Re:Programmers will be happy. by Anonymous Coward · · Score: 0

      if that is true radeon 7990 can do double precision at 1/4 speed of single so that would be 1.75x TFlops double precision for 15% price

    10. Re:Programmers will be happy. by Anonymous Coward · · Score: 0

      On par for what? These can do a heck of a lot more than Kepler cards.

    11. Re:Programmers will be happy. by JanneM · · Score: 2

      Here's a preliminary "best practice" guide: http://www.prace-project.eu/Best-Practice-Guide-Intel-Xeon-Phi-HTML?lang=en

      Seems OpenMP and openMPI are both available, so typical hybrid systems should at least run out of the box, though you'll of course need a fair bit of tuning to make full use of the thing. It should be less work than adapting a system for running on a GPU though.

      --
      Trust the Computer. The Computer is your friend.
    12. Re:Programmers will be happy. by Steve_Ussler · · Score: 1

      Whatever happened to AMD?

    13. Re:Programmers will be happy. by CSMoran · · Score: 1

      Yes, "well-known algorithms" is my use case -- massive LAPACK generalized diagonalizations that take forever on a single CPU, almost forever when threaded with openMP-capable BLAS to, say 8 cores, and do not scale at all to distributed-memory clusters (ScaLAPACK with MPI) because the comms becomes a bottleneck.

      Thus I'm hoping for a solution where the vendor themselves wrangles those intrinsics in their BLAS or LAPACK implementation in MKL with me oblivious to all that mess. Assuming the computation time scales O(N^3) and the memory transfer over the bus scales O(N^2), with the prefactors in my favour, I should be able to squeeze out a significant performance boost.

      --
      Every end has half a stick.
    14. Re:Programmers will be happy. by bill_mcgonigle · · Score: 1

      which don't deal with mass amounts of data that exceeds the memory limits by orders of magnitude (which is roughly the point it becomes a mild pain)

      I was talking to an HPC friend this weekend at the ice cream parlor and he was telling me how their problem had no advantage on GPU processing because they were really memory-bound, not processing-bound.

      He has a quad-rate Infiniband going into each machine (40Gbps) and a couple CPU's, and keeps them saturated (say 5Gbps per core).

      Looking at TFA's expansion card, with a memory bandwidth of 320Gbps and 60 cores, that's only 4Gbps per core and what's worse, you can't push that much over the PCIe 3.0 bus (only 16GBps).

      What they really need is a card with 8 cores and an Infiniband controller right on the die and DMA from one to the other. Then you could fill a housekeeping box chock full of slots with these things and only worry about pushing setup code and managing jobs over the PCIe bus from the mainboard. There's a market niche that needs filling, hardware dudes.

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
    15. Re:Programmers will be happy. by Anonymous Coward · · Score: 0

      If you are an assembly junkie I guess you are right.

      Can't stand coding in assembly. But what else is there? If you're not coding assembly, you're not really a programmer, but some kind of quasi-markup operator.

    16. Re:Programmers will be happy. by Anonymous Coward · · Score: 0

      If you're not coding LISP you're not really a programmer, but some kind of bit twiddler.

    17. Re:Programmers will be happy. by Anonymous Coward · · Score: 0

      Measure the GFLOPS per dollar of a $5,000 Xeon Phi card against a $200 Radeon card and ask that question again.

    18. Re:Programmers will be happy. by Anonymous Coward · · Score: 0

      Then realize that the Phi is general purpose instructions and the Radeon isn't. I thought the geek card carriers on slashdot understood that, stop comparing them as if they are.

    19. Re:Programmers will be happy. by hairyfeet · · Score: 2

      Correct me if I'm wrong but I thought Nvidia had joined in supporting OpenCL so things were gonna be heading in that direction?

      That said its probably smart for Intel and AMD (who i read will soon have their hybrid X86/ARM and ARM+Radeon chips out) to concentrate on the server space as X86 chips have been so insanely powerful for the last several years the consumer and SMB markets has more power than they know what to do with. The simple fact is the software just hasn't kept up with the hardware so you have all these multicores just twiddling their thumbs, and why would you buy faster when you aren't even stressing the one you have now?

      At least in the server space those guys can always use more speed per watt and their programs aren't as single thread heavy as the consumer and SMB space is, so its a smart move.

      --
      ACs don't waste your time replying, your posts are never seen by me.
    20. Re:Programmers will be happy. by _merlin · · Score: 1

      Yeah, sure. I'm glad it works for your use case, and I'm sure it's great for a lot of others, too. Unfortunately it doesn't work for me - there will never be an off-the-shelf library for vol models developed in-house.

    21. Re: Programmers will be happy. by Anonymous Coward · · Score: 0

      Can't we just put back the 80387 socket and move on with our lives?

      Wake me up when I can use these cards as compute resources in vsphere.

    22. Re:Programmers will be happy. by Anonymous Coward · · Score: 0

      The Xeon Phi is a number-crunching device, and not really a general purpose processor. The Radeon is a number-crunching device, and not just a dedicated graphics processor. Where GFLOPS matter (and especially where GFLOPS per dollar matter), there isn't a lot you can't do with today's GPUs and APIs.

      So it's definitely worth comparing them as if they were both number-crunching devices, because that's what they are. And Radeons from several years ago still outperform Xeon Phis from the near future. You don't even need a Xeon Phi to do that math.

      Oh - you can still turn in your geek card by mail, if you like.

  2. I do a lot of CGI Rendering by Silpher · · Score: 1

    Will this be interresting for me? Price/value wise?

    1. Re:I do a lot of CGI Rendering by Anonymous Coward · · Score: 0

      Depending on what kind of software you use it might be. Look at luxrender.net which you can already use with GPU acceleration.

  3. Some SIMD required by Ottibus · · Score: 2

    You won't get full performance from a Xeon Phi without using the SIMD instructions, so it is not as easy to program as you might hope.

    1. Re:Some SIMD required by robthebloke · · Score: 2

      ispc, OpenCL, and LLVM on the way. Failing that, you could of course use C++ and AVX intrinsics (which would be a good choice if you already have a load of SSE4/AVX optimised code lying about).

    2. Re:Some SIMD required by Anonymous Coward · · Score: 0

      the phis has 512-bit vector instructions using zmm registers.
      i don't think you can hit full-tilt boogie with sse4/avx.

    3. Re:Some SIMD required by Ottibus · · Score: 1

      ispc, OpenCL, and LLVM on the way. Failing that, you could of course use C++ and AVX intrinsics (which would be a good choice if you already have a load of SSE4/AVX optimised code lying about).

      Having to use specialist languages like ispc to get performances does not support the claim that Xeon Phi is "a lot easier to program then GPUs". OpenCL is no easier to write on x64 than GPU and is arguably harder. And you certainly can't rely on LLVM (or any compiler) to turn your scalar code into high-performance optimised vector without a significant amount of work.

      So the original claim that "x64 Phi cards are a lot easier to program then GPUs" needs a lot more evidence before it will stand up.

    4. Re:Some SIMD required by robthebloke · · Score: 2

      struct vec3_FPU { float x, y, z; };
      struct vec3_SSE { __m128 x, y, z; };
      struct vec3_AVX { __m256 x, y, z; };
      struct vec3_PHI { __m512 x, y, z; };

      template<typename T>
      T add(const T& a, const T& b)
      {
      T r;
      r.x = add(a.x, b.x);
      r.y = add(a.y, b.y);
      r.z = add(a.z, b.z);
      return r;
      }

      Porting existing SSE4/AVX code to Phi is usually just a case of changing a typedef (or template type param), and overloading a bunch of low level functions (e.g. add, sub, etc). If it's not that simple for you, I'd suspect you may be doing it wrong. Porting from one to the other should only take a day or two at most.

    5. Re:Some SIMD required by Ottibus · · Score: 1

      This supports the argument that porting SSE/AVX code to Xeon Phi is easier than porting SSE/AVX code to GPU. It does little to support the original claim that "x64 Phi cards are a lot easier to program then GPU" which is more general and seems to be about original programming rather than porting.

  4. What about more cores for us mortals? by Anonymous Coward · · Score: 0

    After tantalizing us with an 80-core research CPU umpteen years ago, ordinary consumers have been stuck with core counts in single figures seemingly forever.

    I was expecting 32 cores minimum in desktop CPUs by the start of this decade. All this new supercomputer stuff is well and good, but what about lots of cores for us mere mortals too?

    How about it Intel? You stopped raising the clock speed and said we'd get lots of cores instead. It hasn't happened. Get with the programme please.

    1. Re:What about more cores for us mortals? by bill_mcgonigle · · Score: 1

      I was expecting 32 cores minimum in desktop CPUs by the start of this decade. All this new supercomputer stuff is well and good, but what about lots of cores for us mere mortals too?

      You wouldn't like the speed of typical software on a 32-core CPU using the same transistor count (i.e. at the same cost) of the machine you're running now.

      Cache sharing, NUMA access, etc. turn out to be tricky to get fast, right, and cheap. In the meantime, much of the existing software library can't even properly take advantage of a 6-core desktop chip, so all mere mortals would get today from a 32-core chip would be a slower machine.

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
    2. Re:What about more cores for us mortals? by Anonymous Coward · · Score: 0

      Uh... that's what these Xeon Phi cards are. Lots of cores. FYI, that 80-core research chip wasn't x86.

    3. Re:What about more cores for us mortals? by Fyzzler · · Score: 1

      Uh... that's what these Xeon Phi cards are. Lots of cores. FYI, that 80-core research chip wasn't x86.

      Actually larabee was exactly 80 486DX cores on one die. They just couldn't figure out how to get them to do useful work (They were thinking graphics processing of all things). So they rethought their approach and canceled that project.

      --
      I have one question. If the Japanese Ministry of Agriculture is not in charge of Gundam, then who is?
  5. Tianhe-2 by Anonymous Coward · · Score: 0

    http://en.wikipedia.org/wiki/Tianhe-2
    "There are 16,000 compute nodes, each comprising two Intel Ivy Bridge Xeon processors and three Xeon Phi chips for a total of 3,120,000 cores."

    > Intel Announces New Enterprise Xeons, More Powerful Xeon Phi Cards

    Those Chinese scientists are going to be so pissed. :P

  6. Why bother? by pla · · Score: 1

    In addition, we'll see new Xeons based on this technology later this year, in the 22nm E5-2600 V2 family, with up to 12 cores.

    ...And yet, because of corporate policies on running the shittiest AV on the planet (Symantec) cranked to the max, my desktop PC will still have the responsiveness of a sloth on 'luudes.

    Seriously, I already have 8 cores worth of Xeon (2x4) and the load meter never even twitches, enough RAM to load my entire system drive into, and an SSD system drive. More cores won't help at this point.

    1. Re:Why bother? by Anonymous Coward · · Score: 0

      ...And yet, because of corporate policies on running the shittiest AV on the planet (Symantec) cranked to the max, my desktop PC will still have the responsiveness of a sloth on 'luudes.

      Here's a nickel, kid, go buy yourself a better OS.

    2. Re:Why bother? by Anonymous Coward · · Score: 0

      I'm unclear as to why you have an 8-core rig if you aren't using a heavily parallel workload. Perhaps you should fire your IT department for their purchasing decisions.

    3. Re:Why bother? by Muad'Dave · · Score: 2

      Here's a nickel, kid, go buy yourself a better OS.

      Here's the best part - after 'buying' that better OS, you'll still have the nickel!

      --
      Tiller's Rule: Never use a word in written form that you've only heard and never read. You will end up looking foolish.
    4. Re:Why bother? by NatasRevol · · Score: 1

      Because that's a mid range machine these days.

      --
      There are two types of people in the world: Those who crave closure
    5. Re:Why bother? by Anonymous Coward · · Score: 0

      You need to leave your nerd cave basement if you believe his specs represent anything remotely similar to a 'mid range machine'. That bad case of no-sex might cure right up as well.

  7. How many "Intel Inside" stickers on Tianhe-2? by elwinc · · Score: 4, Funny
    How many "Intel Inside" stickers will they be posting on Tianhe-2? I can see an a argument for a mere 16000 - one per node; 32000 - one per Ivy Bridge chip; and 80000 - one per Intel core carrying chip. But I think Intel's marketing dept should hold out for 3.12 million stickers - one per core!

    It's too bad Thinking Machines Incorporated never had a sticker policy, because the "Fat Tree" routing topology is straight out of TMI (the prior TMI topology, hypercube, didn't allow the customer as much choice to balance cores vs interconnect).

    --
    --- Often in error; never in doubt!
  8. It's a gas! by Impy+the+Impiuos+Imp · · Score: 4, Funny

    Xeon, Itanium. I think I've figured out the real genius at Intel.

    1. Pick a cool element.
    2. Remove a letter.
    3. ?????
    4. Profit!!!

    2015 Arbon
    2018 Heliu
    2023 Litium
    2024 Silion
    2026 Eon

    --
    (-1: Post disagrees with my already-settled worldview) is not a valid mod option.
    1. Re:It's a gas! by Anonymous Coward · · Score: 0

      I'm betting the 2026 chip is a Neo.

    2. Re:It's a gas! by Anonymous Coward · · Score: 0

      Xeon, Itanium. I think I've figured out the real genius at Intel.

      1. Pick a cool element.
      2. Remove a letter.
      3. ?????
      4. Profit!!!

      They took four letters off of Unun pentium.

  9. Xeon Phi=AltiVec? by Anonymous Coward · · Score: 0

    Haven't run into the Phi moniker before. Is this essentially just a souped-up/shrunk version of the AltiVec/Velocity Engine SIMD processor Motorola unveiled in the PowerPC G5? How is it both SIMD and x86 at the same time?

    1. Re:Xeon Phi=AltiVec? by Anonymous Coward · · Score: 0

      http://lmgtfy.com/?q=Xeon+Phi

    2. Re:Xeon Phi=AltiVec? by elwinc · · Score: 2
      Nope.

      AltiVec was Motorola's 1999 SIMD instructions & hardware, a response to the SIMD instructions & hardware released by AMD in 1998 (AMD called theirs 3DNow!). Intel also released SIMD instructions & hardware in 1999, called SSE. 3DNow!, AltiVec & SSE were all 128 bit wide pipes that could handle 4 single precision floating point operations simultaneously in parallel. Some of them may have also been able to do two double precision floats also (not AltiVec though), and they all did various integer ops in parallel too.

      Xeon Phi is a chip that contains around 60 independent specialized Intel X86 cores, plus caches & ring busses for the cores to communicate with each other. The core count is inexact probably because Intel is figuring out the expected number of dead cores on a chip they can ship and still call it a complete chip. Each of the 60 or so specialized cores has a 512 bit wide pipe that will do 16 parallel single precision floating point operations or 8 parallel double precision floats. To call it a "pipe" means a new instruction & data can be issued every clock cycle, and there are a number of instructions "in flight" streaming down the pipeline, with results issuing out of the bottom of the pipe every clock. The pipe is a "fused multiply add" architecture (useful for vector dot products) so theoretically, every clock cycle, the CPU could issue 16 single precision mults and 16 single precision adds, a total of 32 flops per clock per core. Most high performance computing uses double precision, so cut that 32 in half, and multiply 16 flops per clock times 60 cores times about 1.2GHz to get about 1.2 DP teraflops (theoretical) per Xeon Phi chip. Actual flops will be considerably lower if the problem doesn't fit well in cache.

      The bottom half of this article has a nice overview of Xeon Phi specs.

      --
      --- Often in error; never in doubt!
  10. nah by nten · · Score: 1

    2015 Ron
    2018 Aluminum

    --
    refactor the law, its bloated, confusing and unmaintainable.
    1. Re:nah by ColdWetDog · · Score: 1

      2013 Old

      --
      Faster! Faster! Faster would be better!
  11. Export controls by Anonymous Coward · · Score: 0

    What are the export controls nowadays? Interconnects?

  12. no it's not by Chirs · · Score: 1

    An 8-core Xeon (not i7) is not a mid-range desktop. Nor is "enough RAM to load my entire system drive into", or an SSD system drive.

    Even now, those are all higher-end in the general scheme of things. More common on enthusiast machines, sure, but far from "mid-range" in a business system.

  13. Re:Some SIMD requirede what I have available at wo by robthebloke · · Score: 1

    Well, if you've got an NVidia card + XEON (which happens to be what I have available at work), then any newly written code is going to be in OpenCL or LLVM IR (via C++ or custom language). If you're going that route, any code you write will more or less work on Phi with little modification (although I have not got a Phi on which I can actually test my hypothesis here, so I may be talking BS!). So in theory at least, it won't be any harder to write code for Phi than for NVidia/AMD. The thing that appeals to me about Phi the most, is simply the slightly less restrictive way you can address memory, code, and the CPU cores. GPU's were originally designed to be a more or less a one way process. You throw geometry data from the CPU to the GPU, and the GPU throws it on the screen. Whilst GPU's are much more general purpose these days, they do still display that heritage in the occasional moment where you realise "damn, I'm unable to access that memory here", or "damn, I have to split this process into two seperate ones because the hardware says so".

  14. Re:Odium by Anonymous Coward · · Score: 0

    Rypton
    Adium

    and then the portable line

    Roton
    Neuton
    Eectron

    and of course the high-speed

    Phoon

    I likes it!

  15. 12 core Xeon by Plumpaquatsch · · Score: 1
    --
    Of course news about a fake are Fake News.