Slashdot Mirror


AMD Unveils SSE5 Instruction Set

mestlick writes "Today AMD unveiled its 128-Bit SSE5 Instruction Set. The big news is that it includes 3 operand instructions such as floating point and integer fused multiply add and permute. AMD posted a press release and a PDF describing the new instructions."

25 of 85 comments (clear)

  1. Well, I'm excited. I think. by Harik · · Score: 4, Insightful

    So, where's the analysis by people who write optimized media encoders/decoders? How useful are these new instructions, or are they just toys? How well did they handle context switching? What's the CX overhead? Is there a penalty for all processes, or only when you are switching to/from a SSE5 process? Will this be safely usable under all operating systems, or will they need a patch?

  2. APL by Citizen+of+Earth · · Score: 3, Funny

    instructions such as floating point and integer fused multiply add and permute

    So machine languages are APL-compatible these days.

    1. Re:APL by Ilyon · · Score: 2, Interesting

      I would say APL has always been compatible with the various vector/parallel machine languages. With the general but precise nature of APL expression, it should be easy to generically and efficiently parallelize/vectorize any APL interpreter for any machine architecture. Is there much activity in marketing of current APL products? It seems like IBM is doing nothing more than supporting existing customers. Jim Brown and company established SmartArrays, which caters a specific C APL library to specific customers. MicroAPL seems to be diversifying into other areas, although they still update APLX periodically. I haven't seen much action on the open source front, although I have seen an open source APL project on Sourceforge. Is there any chance that the emergence of parallel architectures will spur a resurgence of interest in APL?

  3. Re:...or are they just toys? by theGreater · · Score: 5, Funny

    It ROUNDSS! It ROUNDSS us! It FRCZSS! Nasty AMD added to it.

  4. Re:Well, I'm excited. I think. by PhrostyMcByte · · Score: 3, Interesting

    I don't write those fancy codecs, but I can immediately see where some of these instructions could come in handy - for instance, PCMOV and PTEST (packed cmov/test).

    The new instructions take up an extra opcode byte, but seeing how they will lower the amount of instructions you would otherwise do, I don't see that as a problem. The super instructions (like FMADDPS - Multiply and Add Packed Single-Precision Floating-Point) do more than just help the instruction decoder too - they mention "infinitely precise" intermediate voodoo for several of them which makes it seem like doing a FMADDPS instead of a MULPS,ADDPS will result in a more accurate result.

    There are new 16-bit floating point instructions too, which I can see as a boon for graphics wanting the ease of floating point and a little higher rounding precision than bytes with values between 0 and 255 would give, without the large memory requirements of 32-bit floating point.

  5. It's a couple links deep... by SanityInAnarchy · · Score: 5, Informative

    Read this interview with Dr Dobbs:

    A floating-point matrix multiply using the new SSE5 extensions is 30 percent faster than a similar algorithm

    I believe this helps gaming and other simulations.

    Discrete Cosine Transformations (DCT), which are a basic building block for encoders, get a 20 percent performance improvement

    And then we have the "holy shit" moment:

    For example, the Advanced Encryption Standard (AES) algorithm gets a factor of 5 performance improvement by using the new SSE5 extension

    If I get one of these CPUs, I'll almost certainly be encrypting my hard drives. It was already fast enough, but now...

    As for existing OS support, it looks promising:

    We're also working closely with the tool community to enable developer adoption -- PGI is on board, updates to the GCC compiler will be available this week, and AMD Code Analyst Performance Analyzer, AMD Performance Library, AMD Core Math Library and AMD SimNow (system emulator) are all updated with SSE5 support.

    So, if you're really curious, you can download SimNow and emulate an SSE5 CPU, try to boot your favorite OS... even though they say they're not planning to ship the silicon for another two years. Given that they say the GCC patches will be out in a week, I imagine two years is plenty of time to get everything rock solid on the software end.

    --
    Don't thank God, thank a doctor!
    1. Re:It's a couple links deep... by funfail · · Score: 4, Funny

      Why? Recovery is 5 times faster now.

    2. Re:It's a couple links deep... by gnasher719 · · Score: 3, Informative

      >> And then we have the "holy shit" moment:

      For example, the Advanced Encryption Standard (AES) algorithm gets a factor of 5 performance improvement by using the new SSE5 extension
      If I get one of these CPUs, I'll almost certainly be encrypting my hard drives. It was already fast enough, but now...

      They copied two important features from the PowerPC instruction set: Fused multiply-add (calculate +/- x*y +/- z in one instruction), and the Altivec vector permute instruction, which can among other things rearrange 16 bytes in an arbitrary way. The latter should be really nice for AES, because it does a lot of rearranging 4x4 byte matrices (if I remember correctly).

  6. Foundations for the GPU+CPU assimulation... by WoTG · · Score: 4, Insightful

    I'm not really qualified to make an opinion on this, but my guess is that these instructions will prove increasingly useful as AMD integrates the GPU and CPU. To me, it looks like they plan to make accessing what was traditionally part of the GPU a simple process (relative to accessing a GPU directly through their own pseudo CPU api's).

    It'll take a couple years for "SSE5" to show up in AMD chips... which happens to coincide nicely with their Fusion (combined CPU+GPU) product line plans.

    Will Intel pick up on these instructions? Maybe not. Does that mean they die? No, the performance benefits for those areas where this will make the most difference will make it worthwhile. At the very least, AMD can sponsor patches to the most popular bits of OSS to earn a few PR points (and benchmark points).

    1. Re:Foundations for the GPU+CPU assimulation... by WoTG · · Score: 2, Interesting

      My thought was that the long term plan is to integrate the GPU anyway (for one product line at least). While the GPU is RIGHT THERE, they will find a way to use of much of it as they can when it's not busy with 3D work... which for the average office environment is 95% of the time.

      Gamers can still buy addon graphics cards, of course.

  7. Re:Can someone explain please by NeuralAbyss · · Score: 2, Informative

    The 64-bit designation refers to the width of the address bus*. For example, IA-32 processors have been able to handle 64 bit integers for ages.. so a 64-bit address-capable processor handling 128 bit numbers is nothing new.

    * Yes, PAE was a slight deviation from a 32 bit address space, but in userspace, it's 32 bit flat memory.

  8. Re:Can someone explain please by GroovBird · · Score: 3, Informative

    I believe the 64-bit designation refers to the width of the general purpose registers. This usually correlates to the address space used, but has nothing to do with the address bus. The 8086, for example, while being a 16-bit processor had a 20-bit address bus. The 8088 was a 16-bit processor, but only had an 8-bit data bus to save costs. Both were 16-bit processors, because the general purpose registers (AX, BX, CX, DX) were 16-bit.

    In the x64 world, the general purpose registers are 64-bit wide. This also used to influence the width of the 'int' datatype in the C compiler, although I'm not sure that 'int' is a 64-bit integer when compiling x64 code.

  9. Re:Well, I'm excited. I think. by aquaepulse · · Score: 2, Insightful

    And assuming that a floating-point number is represented by 128 bits, that still means there are only 2^128 (that is, 17,179,869,184) discrete values it can represent. Sadly that's wrong. 17,179,869,184 = 2^34. I mean, is it that difficult for people writing articles to check their math.
  10. Re:Cryptographer's Take? by MrNaz · · Score: 2, Insightful

    Great! I'm glad those two organizations have such a long and distinguished history of self-restraint when it comes to the borders of their mandated spheres of operation.

    --
    I hate printers.
  11. Re:Can someone explain please by forkazoo · · Score: 5, Informative

    The 64-bit designation refers to the width of the address bus*. For example, IA-32 processors have been able to handle 64 bit integers for ages.. so a 64-bit address-capable processor handling 128 bit numbers is nothing new.


    Technically, the "bit designation" of a platform is defined as the largest number on the spec sheet which marketing is convinced customers will accept as truthful. Seriously, over the years different processors and systems have been "16 bit" or "32 bit" for any number of odd and wacky reasons. for example, the Atari Jaguar was widely touted as a 64 bit platform, and the control processor was a Motorola 68000. The Sega Genesis also had a 68k in it, and was a 16 bit platform. The thing is, Atari's marketing folks decided that since the graphics processor worked in 64 bit chunks, they could sell the system as a 64 bt platform. C'est la vie. It's an issue that doesn't just crop up in video game consoles -- I just find the Jaguar a particularly amusing example.

    But, yeah, having a CPU sold as one "bitness" and being able to work with a larger data size than the bitness is not unusual. The physical address bus width is indeed one common designator of bitness, just as you say. Another is the internal single address width, or the total segmented address width. Also, the size of a GPR is popular. On many platforms, some or all of those are the same number, which simplifies things.

    An Athlon64, for example, has 64 bit GPR's, and in theory a 64 bit address space, but it actually only cares about 48 bits of address space, and only 40 of those bits can actual be addressed by current implimentations.

    A 32 it Intel Xeon has 32 bit GPR's, but an 80 bit floating point unit, the ability to do 128 bit SSE computations, 32 bit individual addresses, and IIRC a 36 bit segmented physical address space. but, Intel's marketing knew that customers wouldn't believe it if they called it anything but 32 bit since it could only address 32 bits in a single chunk. (And, they didn't want it to compete with IA64!)
  12. What about 256 bit? by renoX · · Score: 2, Insightful

    For 'serious' scientific computing, they use 64b FP number, having vectors of 4 element seems the right size, so SIMD computations of 4*64=256 seems the 'right size' for these users.

    Sure multimedia & games use lower precision FP computations so 16b or 32b FP number is enough, but it's strange that AMD doesn't try to improve the usage for the scientific computation niche.

    Maybe it's because the change would be expensive as to be efficient, the width of the memory bus should be expanded to 256b from 128b now.

  13. Tom, Jerry, and IOP by tepples · · Score: 2, Informative

    for example, the Atari Jaguar was widely touted as a 64 bit platform, and the control processor was a Motorola 68000. The Jaguar had a 64-bit data bus, a 32-bit CPU "Tom" connected to the GPU, a 32-bit CPU "Jerry" connected to the sound chip, and a 32-bit MC68000 with a 16-bit connection to the data bus, used as an I/O processor (in much the same way that the PS2 uses the PS1 CPU). Some games ran their game logic on "Tom"; others (presumably those developed by programmers hired away from Genesis or Neo-Geo shops) ran it on the IOP. Pretty much only graphics operations ever used the full width of the data bus.
  14. Re:Well, I'm excited. I think. by CryoPenguin · · Score: 3, Informative

    Being thick (and out of coffee) how the hell can any thing be infinitely precise?

    The result will still eventually be stored back into a floating-point number. What it means for an intermediate computation to be infinitely precise is just that it doesn't discard any information that wouldn't inherently be discarded by rounding the end result.
    When you multiply two finite numbers, the result has only as many bits as the combined inputs. So it's quite possible for a computer to keep all of those bits, then perform the addition with that full precision, and then chop it back to 32bits. As opposed to implementing the same operation with current instructions, which would be: multiply, (round), add, (round).
  15. Re:Cryptographer's Take? by gnasher719 · · Score: 2, Interesting

    '' Can one of the cryptographers on slashdot comment on weather this is useful to them or not? ''

    One useful addition (copied from Altivec) is the vector permute instruction. What is clever about it in terms of cryptography is that you can translate a vector using a 256 byte translation table _without doing any memory access_ by using the vector permute instruction in a clever way. Now the execution time is completely data-independent, so one important attack vector is closed.

  16. AMD just forked x86 by RecessionCone · · Score: 2, Interesting

    If you read the fine print, AMD is actually not implementing all of SSE4 on the Bulldozer chip which will be the first to include SSE5. This is disastrous - the SSE "brand" has always implied backwards compatibility: SSE1 contains MMX, SSE2 contains SSE1 & MMX, etc. etc. Now AMD is breaking this, since SSE5 chips will not include all of SSE4. AMD shouldn't have named these new extensions SSE5. As it is, they are forking the x86 instruction set, which is a bad thing for all of us.

    Here's some more information: http://www.anandtech.com/cpuchipsets/showdoc.aspx? i=3073

  17. Re:32-bit Genesis before 16-bit Super NES? by Jagetwo · · Score: 2, Informative

    Motorola 6800x, 68010 are 16-bit designs, that is, 16-bit processors with 32-bit register file. Whenever you used 32-bit operands on those CPUs, they were slower, because it was really executing them in 16-bit parts. Bus was also 16-bits wide, but with 24 address lines. It was just a forward-thinking design hiding 16-bitness.

  18. Re:...or are they just toys? by ben+there... · · Score: 2, Funny

    Nasty AMD added to it. The better question is how the fuck did AMD get to write the next iteration of an Intel technology. Shouldn't it be AMD 3DNow!^2? This is like Apple deciding their next HFS filesystem will be versioned NTFS 7.0.

    They can battle back and forth with version numbers and see who is first to get to 11, the version number where, for whatever reason, developers are forced to come up with a new versioning scheme. That will throw a wrench in the works. Take that Intel!
  19. Re:Well, I'm excited. I think. by gnasher719 · · Score: 2, Informative

    >> Being thick (and out of coffee) how the hell can any thing be infinitely precise? Or atleast while it can be infinitely precise how do you go about checking it... might take a while to prove it for all possible numbers (of which there is an infinite amount of, and for each one you would have to check it to an infinite number of decimal places).

    I'll give you an example. Lets say we are working with four decimal digits instead of 53 binary digits, which is what standard double precision uses. Any operation will behave as if it calculated the infinitely precise result and then rounded it. For example, any result x that is in the range 1233.5 = x = 1234.5 with infinite precision will be rounded 1234.

    Now lets say we calculate x * y + z with infinite precision and round. We have x = 2469, y = 0.5, and z happens to be 0.00000000001. So x * y = 1234.5, x *y + z is just a tiny bit larger, so the result has to be rounded up to 1235. To do this right, you need x * y with infinite precision. Knowing twelve decimals wouldn't be enough. If I told you "x * y equals 1234.50000000 with twelve digit precision", you wouldn't know how to round x * y + z. x * y could be 1234.499999996, and adding z would still be less than 1234.5, so it needs to be rounded down. Or x * y could be 1234.500000004, and x * y + z needs to be rounded up.

    That is meant by "infinite precision": The processor guarantees to give the same result _as if_ it would use infinite precision for the calculation. In practice, it doesn't use infinite precision. About 110 binary digits precision is enough to get the same result.

  20. Re:Well, I'm excited. I think. by arodland · · Score: 2, Informative

    The important word there is intermediate. You don't get a result of infinite precision, you get a 32-bit result (since the parent mentioned single-precision floating point). But it carries the right number of bits internally, and uses the right algorithms, so that the result is as if the processor did the multiply and add at infinite precision, and then rounded the result to the nearest 32-bit float. Which is better than the result you would get by multiplying two 32-bit floats into a 32-bit float, then adding that to another 32-bit float into a 32-bit float. You're limited to 32 bits at all times and therefore you have intermediate precision loss.

    Making sense now?

  21. AES - how is speedup achieved? by Paul+Crowley · · Score: 2, Interesting

    I've just paged through the spec PDF, and I can't work out for the life of me how these instructions help you implement AES. In normal implementations AES does sixteen byte-to-word table lookups per round and these lookups take nearly all the time; they also open up a host of vulnerabilities in side channel attacks. To avoid these lookups you have to have a way of doing the GF(2^8) arithmetic directly, and I can't see any way these instructions will help.

    Anyone got any guesses? Someone who understands Matsui's recent work on bitslice AES implementations better than I do? Will this implementation be resistant to lookup-based side channel attacks?