Slashdot Mirror


Next-Gen Intel Chip Brings Big Gains For Floating-Point Apps

An anonymous reader writes "Tom's Hardware has published a lengthy article and a set of benchmarks on the new "Haswell" CPUs from Intel. It's just a performance preview, but it isn't just more of the same. While it's got the expected 10-15% faster for the same clock speed for integer applications, floating point applications are almost twice as a fast which might be important for digital imaging applications and scientific computing." The serious performance increase has a few caveats: you have to use either AVX2 or FMA3, and then only in code that takes advantage of vectorization. Floating point operations using AVX or plain old SSE3 see more modest increases in performance (in line with integer performance increases).

27 of 176 comments (clear)

  1. Let's see... by bluegutang · · Score: 5, Funny

    " Next-Gen Intel Chip Brings Big Gains For Floating-Point Apps "

    How much of a gain? More or less than 0.00013572067699?

    1. Re:Let's see... by 0100010001010011 · · Score: 5, Informative

      It's a joke. The Intel P5 Pentium FPU had a bug where

      4195835/3145727=1.333739068902037589 The correct answer is 1.333820449136241002.

    2. Re:Let's see... by kimvette · · Score: 2

      Oh right, that bug an Intel rep laughably claimed one would only encounter once every 2,500 years or so. I'd forgotten about that.

      --
      The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50
  2. Hope it's going in the new Mac Pro by GlobalEcho · · Score: 3, Interesting

    I hope there's really a new Mac Pro coming and that it has these chips in it! I do a heck of a lot of PDE solving, statistics and simulations, and would love to have a screamin' machine again.

    1. Re:Hope it's going in the new Mac Pro by Anonymous Coward · · Score: 5, Insightful

      Do you really need a Mac for that? If not, it seems you're limiting your potential by having to wait for the holy artifacts to be released.

    2. Re:Hope it's going in the new Mac Pro by semi-extrinsic · · Score: 5, Interesting

      If you're doing numerics, what the fuck (if you'll pardon my French) are you doing buying Apple? I'm working on two-phase Navier-Stokes solvers myself, and I just bought a new rig consisting of 3 boxes each with a Intel Core i7 @ 3.7 GHz, 12 GB RAM, an SSD drive and a big-ass cooling system. In total that cost less than the Mac Pro with a single Core i7 @ 3.3 GHz listed in that article.You're paying 3x more than you should, and you get what extra? A shiny case? Puh-lease.

      --
      for i in `facebook friends "=bday" 2>/dev/null | cut -d " " -f 3-`; do facebook wallpost $i "Happy birthday!"; done
    3. Re:Hope it's going in the new Mac Pro by spire3661 · · Score: 3, Interesting

      Why not just do that on real workstation hardware and tap into it remotely?

      --
      Good-bye
    4. Re:Hope it's going in the new Mac Pro by mozumder · · Score: 2

      The Mac Pros use Xeon chips, which are usually updated about 1 year after the mainstream Core processors are out.

    5. Re:Hope it's going in the new Mac Pro by Aardpig · · Score: 3, Insightful

      Erm -- ECC memory is slower than non-ECC memory, I think.

      --
      Tubal-Cain smokes the white owl.
    6. Re:Hope it's going in the new Mac Pro by washu_k · · Score: 5, Informative

      The Core i7's are consumer-grade processors and are slower than the Xeon's the Mac Pros use

      This is completely incorrect. The current Mac Pros use Nehalem based Xeons which are two generations back from the current Ivy Bridge i7s. Xeons may have differences in core count, cache and/or ECC support but their execution units are the same as their desktop equivalents. The base Mac Pro CPU is equivalent to an i7-960 with ECC support. The current Ivy Bridge i7s are a fair bit faster.

    7. Re:Hope it's going in the new Mac Pro by KonoWatakushi · · Score: 5, Informative

      ECC memory is only marginally slower. Considering error rates and modern memory sizes, it is far past time that it became a standard feature. The extra cost would be totally insignificant if were standard, and not used as an excuse to gouge people on Xeons.

    8. Re:Hope it's going in the new Mac Pro by LordLimecat · · Score: 2

      Youre paying at least double for the same hardware on a Mac. The Mac cited in the article has 2x 6-core Xeons @ 2.4gHz. Those (assuming E5645s) can be had for ~$575 each, with a motherboard at ~$275. Everything else is pocket change; a whole right with SSDs etc could be had for under $1700.

      But Im sure someone somewhere will explain why the aluminum makes the extra $2000 for the Mac worth it.

    9. Re:Hope it's going in the new Mac Pro by epyT-R · · Score: 2

      It depends. Depending on the generation of xeon, you pay for the privilege of some combination of ECC RAM/cache, more cache, and multisocket capability. In many cases (like the pentium 4 era), you got a p4 with more cache that wasn't much faster than the desktop variant, even with 'enterprise' loads like databases! In the pentium 3 xeon days, you got marginal benefits with the extra cache, yet paid A LOT more for the hardware. With Xeon, the performance boost rarely justified the cost. Intel knew this, so that's why, these days, multisocket capability is a xeon exclusive: to make you pay dearly for that privilege.

      Obviously, if you truly need these features you'll have no choice but to pay up, but these chips failure rates and performance are not any different than the consumer models of the same design at a given clockspeed. They're built on the same manufacturing technology and it is unlikely that intel bins either variant beyond the clockspeeds and TDP stamped on the box. While I don't deny that some critical systems need things like ECC, your post reads like a typical arrogant mac user perspective: someone desperate for social exclusivity trying to justify his overexpenditure.

  3. Might be important, but probably not... by MasseKid · · Score: 4, Interesting

    For problems where you need floating point AND is not multithread friendly AND need large computing power AND is specially coded, then this will be of great use. However, most massive computing problems like this are multi-thread friendly and this will still be roughly an order of magnitude from the speeds you can get by using a GPU.

    1. Re:Might be important, but probably not... by semi-extrinsic · · Score: 3, Insightful

      The good thing about manufacturers speeding up SSE/AVX/etc. is that the linear algebra libraries (specifically the ATLAS implementation of BLAS and LAPACK) usually release code that makes use of the new hawtness in about six months after release. Do you know how much software relies on BLAS and LAPACK for speed?

      --
      for i in `facebook friends "=bday" 2>/dev/null | cut -d " " -f 3-`; do facebook wallpost $i "Happy birthday!"; done
    2. Re:Might be important, but probably not... by Bengie · · Score: 2
      http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner

      An important component of the Intel Xeon Phi coprocessor’s core is its vector processing unit (VPU), shown in Figure 5. The VPU features a novel 512-bit SIMD instruction set, officially known as Intel® Initial Many Core Instructions (Intel® IMCI). Thus, the VPU can execute 16 single-precision (SP) or 8 double-precision (DP) operations per cycle. The VPU also supports Fused Multiply-Add (FMA) instructions and hence can execute 32 SP or 16 DP floating point operations per cycle. It also provides support for integers.

  4. Re:Would that improve hashing speeds in, say, Bitc by slashmydots · · Score: 4, Informative

    Slightly, but you haven't been keeping up on the latest hardware? My pair of Sapphire 5830's graphics cards would top off at about 435MH/s at a total system wattage of around 520W. The new Jalapeno chips from butterfly labs will do 4500 MH/s using 2 watts total system power. For comparison, my i5-2400 performed 14MH/s at 95W or so. So the Jalapeno is about 321x faster and about 47x more power efficient so combined, I believe that's 15,267.864x more efficient.

  5. Less rounding of floating point numbers by raymorris · · Score: 4, Informative

    While it's got the expected 10-15% faster for the same clock speed for integer applications, floating point applications are almost twice as a fast HTH

    Integer and floating point are separately implemented in the hardware, so an improvement to one often doesn't apply to the other. You can add integers by counting on your fingers. To do that with floating point, you have to cut your fingers into fractions of fingers - a very different process.
    See: http://en.wikipedia.org/wiki/FMA3
    It's common to have an accumulator like this:

    X = X + (Y * Z)

    To compute that in floating points, the processor normally does:

    A= ROUND(Y*Z) X=ROUND(X+A)

    Each ROUND() is necessary because the processor only has 64 bits in which to store the endless digits after the decimal point. FMA can fuse the multiply and the add, getting rid of one rounding step, and the intermediate variable:

    X= ROUND( X + (Y*Z) )

    That makes it faster. Since integers don't get rounded to the available precision, the optimization doesn't apply to integers. The above processor would do Y*Z, then +X, then round, then X=. A CPU designer can make that faster by including either a "add and multiply" circuit or a "add and round" circuit or a "round and assign' circuit. Any set of operations can be done in two clock cycles, if the maker decides to include a hardware circuit for it.

  6. Re:Would that improve hashing speeds in, say, Bitc by Anonymous Coward · · Score: 3, Insightful

    Would that improve hashing speeds in, say, Bitcoin?

    Bitcoin is based on SHA256 hashing, which has zero floating point operations. So no, this will not impact Bitcoin mining at all.

  7. Re:128 bit floats: when? by gnasher719 · · Score: 2

    While speed for single and double floats is all well and good, I wonder - when will there finally be hardware support for 128 bit (quadruple precission) floats?

    It was there on PowerPC for many years, and with Haswell it will be there for x86 as well. FMA is all you need for efficient 128 bit arithmetic.

  8. FMA4 by ssam · · Score: 3, Informative

    Pah. AMD had FMA4 since 2011

  9. Re:128 bit floats: when? by Twinbee · · Score: 2

    It would prevent the need to some extra math for extra high numbers (not just those that end on a high numbers, but where the intermediate calculation may be high (e.g.: factorial math to find out the probability of something if I recall). Plus, 96 bits is more than enough for the fraction if you ask me - very greedy in fact to take that to 112 at the cost of 16 bits the exponent could well do with.

    --
    Why OpalCalc is the best Windows calc
  10. GT3 by edxwelch · · Score: 3, Interesting

    AMD has lost the CPU race a long time ago, but still beats Intel with integrated graphics. Now, It looks like Haswell could win that battle too.
    The article shows GT2 to be 15% - 50% faster than the old HD4000. That's still a bit slower than Trinity, but GT3 has double the execution units than GT2, potentially blowing anything away that AMD could offer.

  11. Re:Poor AMD by dshk · · Score: 4, Insightful

    AMD already has FMA3. They also published great results. Of course nobody read it, at least I have seen mentioned it in the usual generic benchmark articles people like to refer (which does not use FMA3).

  12. You Obviously Never Used Sun Servers W/O ECC by raftpeople · · Score: 2

    In the early 2000's we had some, every week one of them would crash. All the other servers w/ECC, no crash. Hardly a marketing gimmick.

  13. Re:Would that improve hashing speeds in, say, Bitc by slashmydots · · Score: 2

    You can sell them on the exchange quickly and easily for USD (or 5 other major currencies)

  14. Re:128 bit floats: when? by Jeremy+Erwin · · Score: 2

    here's an old paper describing octuple precision on the PowerPC G4

    Many problems in number theory and the computational and physical sciences, espe- cially in recent times, require more floating point precision than is commonly available in fundamental computer hardware. For example, the new science of “experimental mathematics,” whereby algebraic truths are foreshadowed, even discovered numerically, requires much more than single (32-bit) or double (64-bit) precision.

    That paper references Bailey's 2000 paper on Quad double algorithms, which alludes to "pure mathematics, study of mathematical constants, cryptography, and computational geometry