AMD Unveils SSE5 Instruction Set
mestlick writes "Today AMD unveiled its 128-Bit SSE5 Instruction Set. The big news is that it includes 3 operand instructions such as floating point and integer fused multiply add and permute.
AMD posted a press release and a PDF describing the new instructions."
in 2009 I'll be holding out for SSE8 anyway.
So, where's the analysis by people who write optimized media encoders/decoders? How useful are these new instructions, or are they just toys? How well did they handle context switching? What's the CX overhead? Is there a penalty for all processes, or only when you are switching to/from a SSE5 process? Will this be safely usable under all operating systems, or will they need a patch?
So machine languages are APL-compatible these days.
It ROUNDSS! It ROUNDSS us! It FRCZSS! Nasty AMD added to it.
I don't write those fancy codecs, but I can immediately see where some of these instructions could come in handy - for instance, PCMOV and PTEST (packed cmov/test).
The new instructions take up an extra opcode byte, but seeing how they will lower the amount of instructions you would otherwise do, I don't see that as a problem. The super instructions (like FMADDPS - Multiply and Add Packed Single-Precision Floating-Point) do more than just help the instruction decoder too - they mention "infinitely precise" intermediate voodoo for several of them which makes it seem like doing a FMADDPS instead of a MULPS,ADDPS will result in a more accurate result.
There are new 16-bit floating point instructions too, which I can see as a boon for graphics wanting the ease of floating point and a little higher rounding precision than bytes with values between 0 and 255 would give, without the large memory requirements of 32-bit floating point.
Can one of the cryptographers on slashdot comment on weather this is useful to them or not?
(yes, I am paranoid... why do you ask? are you with the CIA?)
-jX
Don't you just love politics? It's like a comedy of errors.
Can someone explain how a 64bit processor can run 128 bit instructions, or what this actually means? Thanks
- Aetheral Research -
Read this interview with Dr Dobbs:
I believe this helps gaming and other simulations.
And then we have the "holy shit" moment:
If I get one of these CPUs, I'll almost certainly be encrypting my hard drives. It was already fast enough, but now...
As for existing OS support, it looks promising:
So, if you're really curious, you can download SimNow and emulate an SSE5 CPU, try to boot your favorite OS... even though they say they're not planning to ship the silicon for another two years. Given that they say the GCC patches will be out in a week, I imagine two years is plenty of time to get everything rock solid on the software end.
Don't thank God, thank a doctor!
I'm not really qualified to make an opinion on this, but my guess is that these instructions will prove increasingly useful as AMD integrates the GPU and CPU. To me, it looks like they plan to make accessing what was traditionally part of the GPU a simple process (relative to accessing a GPU directly through their own pseudo CPU api's).
It'll take a couple years for "SSE5" to show up in AMD chips... which happens to coincide nicely with their Fusion (combined CPU+GPU) product line plans.
Will Intel pick up on these instructions? Maybe not. Does that mean they die? No, the performance benefits for those areas where this will make the most difference will make it worthwhile. At the very least, AMD can sponsor patches to the most popular bits of OSS to earn a few PR points (and benchmark points).
Being thick (and out of coffee) how the hell can any thing be infinitely precise? Or atleast while it can be infinitely precise how do you go about checking it... might take a while to prove it for all possible numbers (of which there is an infinite amount of, and for each one you would have to check it to an infinite number of decimal places).
:)
One of my pet peeves is statements like infinite precise
--- Users are like bacteria -> Each one causing a thousand tiny crises until the host finally gives up and dies.
A very quick Google search for "infinite precise" yielded this.
What I think you meant was, "How can the infinitely precise number be stored and accessed by a computer?" Well, that's not the same thing.
Guess the filter didn't like my url.... http://www.bookrags.com/research/infinite-precisio n-arithmetic-wcs/ in plain text.
If they take anything close to the same attitude with their GPUs as they just did with their new CPU instruction set, that would mean we'd finally have a reasonably fast GPU with a completely open software stack.
As it is, ATI/AMD is maybe less proprietary than nVidia, but their Linux support sucks. Intel, however, typically has very good support, even though it's entirely open drivers, and apparently not sponsored much by Intel itself.
Don't thank God, thank a doctor!
So since the people posting and reading here are likely to have some knowledge about the instruction set, if anyone can provide me with a link to the full instruction set (less these new instructions, I expect), I would be very greatful.
I'm an American. I love this country and the freedoms that we used to have.
Context switching doesn't apply. There's no such thing as an SSE5 process. All non-privileged instructions on the CPU are available to the processes that run on it. The OS swaps out the full state of the CPU when switching context, so it swaps those SSE registers out as well. Therefore, the OS must know what registers to swap out, but since these instructions appear to work on the same ol' SSE/SSE2 registers, a relatively recent OS should have no problem supporting applications that use them.
Basically, Matlab, Numpy, FORTRAN, and similar languages have the array processing features of APL with a more traditional syntax. So, interest in APL has never really disappeared.
For 'serious' scientific computing, they use 64b FP number, having vectors of 4 element seems the right size, so SIMD computations of 4*64=256 seems the 'right size' for these users.
Sure multimedia & games use lower precision FP computations so 16b or 32b FP number is enough, but it's strange that AMD doesn't try to improve the usage for the scientific computation niche.
Maybe it's because the change would be expensive as to be efficient, the width of the memory bus should be expanded to 256b from 128b now.
We're used to seeing Intel and AMD introduce new features quite regularly, but I don't really have a feel for where this is going. Are we witnessing the evolution of two entirely separate architectures here?
:-)
If this trend continues then the common set of original x86 instructions could end up as a historical relic, because if your code uses only those old instructions then it might run REALLY slowly on both manufacturers' CPUs, since the advanced manufacturer-specific instructions will be sitting around idle.
Or, is each manufacturer implementing the others' special instructions too?
A question for those who are keeping track of instruction sets.
The REX prefix for R8..R15 instructions is bloated code underperforming all.
You only can do is:
* buy a tri-core xbox360 3.2 GHz ppc64 (512 MiB of RAM)
* buy a G5 ppc64 (ppc970) (2 GiB of RAM)
* buy a XCluster blade (ppc970)
* buy a mono-core ps3 with 7 idle nurses (256 MiB of RAM).
* o don't but it until 2 years later (use still Full-System-Simulator from IBM)
I want a pure 64-bit x 8 Altivec/VMX and not 32-bit.
Oh, be fair. It's only 33 orders of magnitude. (base 10, anyway)
In a fair world, refrigerators would make electricity.
The result will still eventually be stored back into a floating-point number. What it means for an intermediate computation to be infinitely precise is just that it doesn't discard any information that wouldn't inherently be discarded by rounding the end result.
When you multiply two finite numbers, the result has only as many bits as the combined inputs. So it's quite possible for a computer to keep all of those bits, then perform the addition with that full precision, and then chop it back to 32bits. As opposed to implementing the same operation with current instructions, which would be: multiply, (round), add, (round).
He paid $165 each for AMD X2 3800+ cpus?? Remind me never to buy from that NFP enterprises place the hawks in his writeup. Sounds like a ripoff joint.
You owe me a cup of coffee and a new keyboard.
Living With a Nerd
It also states here that a 16-bit architecture is one with a 16-bit data bus, address bus or register size. Wouldn't that make the Super NES an 8-bit system? Its 65C816 CPU had 16-bit registers and an 8-bit data bus. And was the Nintendo 64 an 8-bit system because it used 8-bit RDRAM at a comparatively high clock rate for the time? Perhaps the Motorola 68000 was never advertized as a 32-bit machine, because that sort of marketing ploy was not exercised at the time? Believe me, bit counts were the marketing ploy of the time.
...and 3DNow! was AMD's. Doesn't seem right for AMD to be introducing an SSE variant.
Share and Enjoy: 09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
If you read the fine print, AMD is actually not implementing all of SSE4 on the Bulldozer chip which will be the first to include SSE5. This is disastrous - the SSE "brand" has always implied backwards compatibility: SSE1 contains MMX, SSE2 contains SSE1 & MMX, etc. etc. Now AMD is breaking this, since SSE5 chips will not include all of SSE4. AMD shouldn't have named these new extensions SSE5. As it is, they are forking the x86 instruction set, which is a bad thing for all of us.
? i=3073
Here's some more information: http://www.anandtech.com/cpuchipsets/showdoc.aspx
He took the number of values representable by a 32-bit number and multiplied it by four, since 128 = 32 * 4. Makes perfect sense!
For those who actually understand real molecular nanotechnology, aka "Drexlerian" nanotechnology, you may understand that one of the real "breakthroughs" comes when you can computationally simulate the function of a 4 to 8 million atom molecular nanoassembler. Because if you can simulate one and prove that it does not violate any laws of physics then one of the classical oppositions to real molecular nanotechnology falls [1]. The argument transitions entirely from "it can't work" (common among people oriented towards "dissing" nanotech) to "you can't build one" . And as DRM, the iPhone restrictions, etc. have all shown "can't" is very swampy territory to wade into.
Now, I know if I've got 8 million cores, such a simulation is probably feasible (and presumably bandwidth limited by hypertransport data transfer rates) so the question transitions to how many atoms can one core handle and that in turn transitions to how effective the instruction set is at performing the math required for molecular dynamics simulations. So, is SSE5 any better than this or should I be lobbying AMD for SSE6 which is explicitly targeted at molecular dynamics simulations? It is not the market for business computing but it is the market that potentially millions of "nanoengineers" will fall into.
It also goes without saying that the chip manufacturers and ubergamers and SecondLife participants all have a high interest in achieving this because pushing below ~32nm using current technology is going to get very dicey at which point Moore's Law is going to have to shift from bulk atom assembly (current lithography methods) to precision atom assembly (real molecular nanoassembly).
1. There is a third argument against the simulation of a molecular nanoassembler. The argument that an atom specific design for a 4-8 million atom nanoassembler does not currently exist. The best one can point to is a few thousand atom Fine Motion Controller (http://www.imm.org/research/parts/controller/) designed by Drexler and Merkle. However the Nanoengineer software (http://www.nanoengineer-1.com/content/) from Nanorex allows one to design elements of an actual nanoassembler. If even a mere one thousand /. readers were to add 1 atom a day to the design in a distributed open source NanoAtHome.org (http://www.nanoathome.org/) type project -- the design would be complete within 1-2 years (there is a significant amount of redundancy and therefore human intellect amplification in the atom placement in a nanoassembler). You can't simulate it without designing it first -- but if one can design 400 million transistor microprocessors then designing an 8 million atom nanoassembler shouldn't be that difficult.
Good luck recovering any information when your hard drive dies entirely.
Don't thank God, thank a doctor!
Wouldn't a theoretical quantum computer be more helpful, since you can evaluate many bit combinations simultaneously?
They can battle back and forth with version numbers and see who is first to get to 11, the version number where, for whatever reason, developers are forced to come up with a new versioning scheme. That will throw a wrench in the works. Take that Intel!
>> Being thick (and out of coffee) how the hell can any thing be infinitely precise? Or atleast while it can be infinitely precise how do you go about checking it... might take a while to prove it for all possible numbers (of which there is an infinite amount of, and for each one you would have to check it to an infinite number of decimal places).
I'll give you an example. Lets say we are working with four decimal digits instead of 53 binary digits, which is what standard double precision uses. Any operation will behave as if it calculated the infinitely precise result and then rounded it. For example, any result x that is in the range 1233.5 = x = 1234.5 with infinite precision will be rounded 1234.
Now lets say we calculate x * y + z with infinite precision and round. We have x = 2469, y = 0.5, and z happens to be 0.00000000001. So x * y = 1234.5, x *y + z is just a tiny bit larger, so the result has to be rounded up to 1235. To do this right, you need x * y with infinite precision. Knowing twelve decimals wouldn't be enough. If I told you "x * y equals 1234.50000000 with twelve digit precision", you wouldn't know how to round x * y + z. x * y could be 1234.499999996, and adding z would still be less than 1234.5, so it needs to be rounded down. Or x * y could be 1234.500000004, and x * y + z needs to be rounded up.
That is meant by "infinite precision": The processor guarantees to give the same result _as if_ it would use infinite precision for the calculation. In practice, it doesn't use infinite precision. About 110 binary digits precision is enough to get the same result.
The important word there is intermediate. You don't get a result of infinite precision, you get a 32-bit result (since the parent mentioned single-precision floating point). But it carries the right number of bits internally, and uses the right algorithms, so that the result is as if the processor did the multiply and add at infinite precision, and then rounded the result to the nearest 32-bit float. Which is better than the result you would get by multiplying two 32-bit floats into a 32-bit float, then adding that to another 32-bit float into a 32-bit float. You're limited to 32 bits at all times and therefore you have intermediate precision loss.
Making sense now?
Perhaps. While molecular dynamics simulations are inherently "quantum", I have yet to see a paper which proposes how to solve the equations using a quantum compute and Perhaps a chicken and egg situation. Perhaps after multi-Qubit computers are common one will see attempts at having them perform molecular dynamics simulations. Until then, the equations for molecular simulations are reasonably well defined (electrostatic interactions between nuclei surrounded by electron clouds in motion). A non-trivial computational problem but one which we can understand from a theoretical perspective and model reasonably accurately. It is somewhat similar to simulations involving the formation of solar systems but at a much different scale.
This isn't adding new registers. It doesn't have the MMX defect. It's just more SSE stuff.
I think you may already have what you need for the simulation of such a device. Folding at home has been pumping out protein sequences for years-- but especially now that we have GPGPU I would imagine the simulation wouldn't be too difficult.
As for designing the system that you want to simulate; the thing with microprocessors is that they're very modular. You can create a register, use it 256 or however many times, and there's your cache. Then you build the part that interfaces the rest of the CPU with that group of registers, and deals with addressing, etc; and there you've got something that you can reuse again and again by simply making minor modifications to the gate schematic if, say, you wanted a 64 bit register instead of the 32 bit register you'd already designed. So the processors we're working with now are largely the result of sitting on shoulders of giants. The majority of the work has already been done, and then the engineers add a few things here and there like MMX/3dnow, SSE, etc; and then make various architecture changes, some minor, some major (Hyper Transport/On Die memory controller comes to mind), etc.
Now is your molecular nanoassembler modular like this? For sure it doesn't have the years of design and reusable hardware behind it that microprocessors have; so the comment about 400m transistor processors isn't exact applicable as far as I can tell-- one is such a mature technology that design can be, if one were on a very very tight budget and simply had the resources, literally copy and past; the other would require much initial R&D if I understand the idea correctly (you'd have to first design the mobility you need of the arms of the assembler, then design the best mechanical way to implement those arms, and make sure at such small scales they can withstand the torque you need, and then finally turn that into a molecular design)? And then write/tweak the software to interact with your design and start the simulation, etc.
Come to think of it, it seems to me processing power would be the least of the worries; but I really don't know.
That's what I was asking, thanks. I missed that it hadn't added any new SSE registers. Don't be so quick on the "No such thing as a SSE5 process" though - there IS such a thing as a FPU process, because of an ancient design decision from intel that had the FPU as a coprocessor. That's stuck with us right to the point of 64bit processors - and they still have to emulate it in 32bit mode.
Could you clarify this? The only thing that I'm aware of is that as part of the MMX instruction set, if you use MM registers you need to clear them (EMMS instruction) before you can use the FPU.
It did actually make sense before my post.... it is just not infinitely precise in a pure sense of the word infinite. I was being somewhat hmmm... me - my (oh god, what's the word I'm looking for - kind of style) does not come across well sometimes in textual form.
--- Users are like bacteria -> Each one causing a thousand tiny crises until the host finally gives up and dies.
I've just paged through the spec PDF, and I can't work out for the life of me how these instructions help you implement AES. In normal implementations AES does sixteen byte-to-word table lookups per round and these lookups take nearly all the time; they also open up a host of vulnerabilities in side channel attacks. To avoid these lookups you have to have a way of doing the GF(2^8) arithmetic directly, and I can't see any way these instructions will help.
Anyone got any guesses? Someone who understands Matsui's recent work on bitslice AES implementations better than I do? Will this implementation be resistant to lookup-based side channel attacks?
Xenu loves you!