Nvidia Pascal GP100 GPU To Rock 4 TFLOPS Double Precision, 12 TFLOPS Single Precision Processing Power (techtimes.com)
New information emerged regarding Nvidia's Pascal GPU, covering the total compute performance of the much-anticipated FinFET-based chip. Based on a number of slides from an independent researcher, the Nvidia Pascal GPU100 features Stacked DRAM (1 TB/s) giving it as much as 12 TFLOPs of Single-Precision (FP32) compute performance. The flagship GPU is purportedly able to provide four TFLOPs of Double-Precision (FP64) compute performance as well.
Those numbers make it look like they were using a 32x32 hardware multiplier-adder and the new one uses a 64x64. Multiplying is a great example of how a 2x increase in transistor density from Moore's law can result in something far greater than 2x real speed increase. To do a 64x64 multiply in an 8 bit cpu (like the 6809 which had an 8x8 multiply instruction) you would have to do 56 separate multiplies (for the significand) and then 16 sums before a number of other sums and shifts to get the exponent normalized. Each of those instructions would take 2 to 11 cpu cycles. A 16 bit hardware multiplier would reduce 56 mul operations to 16 and a 32 bit hardware multiplayer would reduce it to 4. The barrel multiplier is often the largest structure in the ALU part of even a modern CPU. They show up on photos of modern chips as the largest rectangle area that isn't cache or memory controllers.
It is amazing to recall that the world's top supercomputer ASCI Red from 1997 to 2000 was only capable of just over 1 TFLOP.