Slashdot Mirror


NVIDIA Unveils Tesla V100 AI Accelerator Powered By 5120 CUDA Core Volta GPU (hothardware.com)

MojoKid writes: NVIDIA CEO Jen-Hsun Huang just offered the first public unveiling of a product based on the company's next generation GPU architecture, codenamed Volta. NVIDIA just announced its new Tesla V100 accelerator that's designed for AI and machine learning applications, and at the heart of the Tesla V100 is NVIDIA's Volta GV100 GPU. The chip features a 21.1 billion transistors on a die that measures 815mm2 (compared to 12 billion transistors and 610mm2 respectively for the previous gen Pascal GP100). The GV100 is built on a 12nm FinFET manufacturing process by TSMC. It is comprised of 5,120 CUDA cores with a boost clock of 1455MHz, compared to 3585 CUDA cores for the GeForce GTX 1080 Ti and previous gen Tesla P100 AI accelerator, for example. The new Volta GPU delivers 15 TFLOPS FP32 compute performance and 7.5 TFLOPS of FP64 compute performance. Also on board is 16MB of cache and 16GB of second generation High Bandwidth (HBM2) memory with 900GB/sec of bandwidth via a 4096-bit interface. The GV100 also has dedicated Tensor cores (640 in total) accelerating AI workloads. NVIDIA notes the dedicated Tensor cores also allow for a 12x uplift in deep learning performance compared to Pascal, which relies solely on its CUDA cores. NVIDIA is targeting a Q3 2017 release for the Tesla V100 with Volta, but the timetable for a GeForce derivative family of consumer graphics cards has has not been disclosed.

37 comments

  1. LEAVE TESLA ALONE!! by Narcocide · · Score: 1

    He's dead. He died crazy, poor and lonely, because he was an unappreciated genius who'd been repeatedly robbed by corporate villains. I swear if the next thing some greedy corporate bastards name after Tesla isn't a cure for mental illness or a solid-state generator that provides unlimited free wireless power I'm going to blow a gasket.

    1. Re:LEAVE TESLA ALONE!! by Anonymous Coward · · Score: 0

      With this product, you get TWO dead geniuses for the price of one! You're going to run out of gaskets to blow!

    2. Re:LEAVE TESLA ALONE!! by Stoutlimb · · Score: 4, Interesting

      Where you see shame, I see honour and respect. It took generations for the public to learn the truth about his genius and tragedy. What better historical revenge than to slap HIS name on all the best and brightest things mankind creates with electricity? I can't think of a more just legacy. I think if science were to resurrect him, we would see tears of joy as the world lovingly respects his discoveries and hard work.

    3. Re:LEAVE TESLA ALONE!! by Picodon · · Score: 3, Insightful

      Having a unit of measurement named after you by the scientific community is quite enough honour and respect, and it sure beats having a corporation trying to make an extra buck by exploiting your name (and posthumous fame) for an ephemeral product line, without your consent.

      Besides, do you really think that the marketing oils at Nvidia sat in a conference room asking themselves: “Guys, what deserving hero could we possibly honour with this product?”, rather than: “What name is likely to strike a fancy within our target demographics? Lightning? Magnetos? Hey, how about Tesla? Yeah, it worked for Musk!”

    4. Re:LEAVE TESLA ALONE!! by Anonymous Coward · · Score: 0

      He was a fucking nutter by the end of his life. Surely this fetishization of his legacy is just a phase and in twenty years Tesla will be just a car company.

    5. Re:LEAVE TESLA ALONE!! by K.+S.+Kyosuke · · Score: 1

      LEAVE TESLA ALONE!

      He died ... lonely

      Eh...mission accomplished?

      --
      Ezekiel 23:20
    6. Re:LEAVE TESLA ALONE!! by Anonymous Coward · · Score: 0

      you may be interested to learn that everything people attribute to tesla had been invented by someone else years earlier, and that he had a string of lawsuits from the real inventors alleging that he was nothing more than a serial plagiarist

      he wasn't robbed by corporate villains. he robbed *them*. wardenclyffe cost enormous money, never would have worked, and even if it would have, you still can't broadcast free electricity to the planet without a fuel source

    7. Re:LEAVE TESLA ALONE!! by Anonymous Coward · · Score: 0

      Agreed. All this fixation on Tesla is nothing more than a hipster circle-jerk. Tesla was brilliant. But he wasn't that much more brilliant than thousands of other brilliant minds. Where's all the parades for Babbage? None? Oh, that's right, it's because The Oatmeal can't write a self-righteous tear jerker about him.

    8. Re:LEAVE TESLA ALONE!! by Anonymous Coward · · Score: 0

      No way! He created the biggest doomsday machine in the world! We are just too ignorant to know about its existence but just look at what happened in Tunguska, 1908! Must be legit! /s

  2. Born crippled by Anonymous Coward · · Score: 0

    " but the timetable for a GeForce derivative family of consumer graphics cards has has not been disclosed."

    Don't hold your breath. To Nvidia, consumer means "intentionally disabled" For all their "cute" shenanigans, Nvidia can Kiss My Royal Irish Ass.

    1. Re:Born crippled by Anonymous Coward · · Score: 0

      That must explain why their consumer products consistently outperform ATI cards.

    2. Re:Born crippled by Anonymous Coward · · Score: 0

      Nvidia intentionally cripples their consumer cards to cripple there CUDA performance relative to their "pro" line.

    3. Re:Born crippled by devoid42 · · Score: 5, Informative
      You are actually applying a lot of ill intent here where they are just using a standard business practice among both GPU and CPU companies. The majority of the chips come from the same production line. Chips that fail QA on a certain % of their CUDA cores are "binned down" to consumer level chips. This allow them to recoup costs and provide an adequate supply of pro chips while keeping prices relatively low.

      There does come a time though later in their production cycle where the production line begins to be well tuned and provides a high yield of pro level chips that surpasses the demand for those chips, in that case the vendor just sets the core count to what is required and ships to match demand.

      No real ill intent here just good business practice, you are paying for what is promised to you, and if you find a way to re-enable the extra hardware so be it. This was done in many quadro/geforce cards in the past.

      --

      I am a figment of my own imagination.

    4. Re:Born crippled by guruevi · · Score: 1

      They are servicing an entirely different market, if you want better CUDA performance, especially double precision you need to get the "Pro" line because it simply has better double precision performance but then you don't get as good graphics/gaming performance (which requires mostly single precision).

      On the other hand, the GeForce lines don't have any protections against data issues like ECC memory but ECC memory is also slower.

      You don't play Crysis on a Tesla (it doesn't even have an output port) or on a Quadro (most of them don't even have lower-quality ports like HDMI). You don't run CUDA calculations on a GeForce (unless you really don't care about the accuracy of your results).

      --
      Custom electronics and digital signage for your business: www.evcircuits.com
    5. Re:Born crippled by Kjella · · Score: 2

      You are actually applying a lot of ill intent here where they are just using a standard business practice among both GPU and CPU companies. The majority of the chips come from the same production line. Chips that fail QA on a certain % of their CUDA cores are "binned down" to consumer level chips. This allow them to recoup costs and provide an adequate supply of pro chips while keeping prices relatively low.

      Well there's certainly that from the supply side, but they're hardly that innocent. Every company tries to create products that make sure the people who can afford it pick that product and not a cheaper one. The classic quote on this is Dupuit (1849):

      It is not because of the few thousand francs which would have to be spent to put a roof over the third-class carriage or to upholster the third-class seats that some company or other has open carriages with wooden benches... What the company is trying to do is prevent the passengers who can pay the second-class fee from travelling third class; it hits the poor, not because it wants to hurt them, but to frighten the rich... And it is again for the same reason that the companies, having proved almost cruel to the third-class passengers and mean to the second-class ones, become lavish in dealing with first-class customers. Having refused the poor what is necessary, they give the rich what is superfluous.

      This is how you choose to not include some feature like Intel's missing consumer ECC support - which apparently AMD can afford to include, so clearly it's not that expensive - simply so the right pick people pick the "right" product. You can certainly claim some of this is for cost saving on the bill of materials or validation cost, but that's often just part of the reason or simply an excuse.

      A smart company also doesn't want to create their own Osborne effect even if their performance comes more in leaps and bounds. Money comes from having a constant supply of product that's always better than the last one, if your performance would be like 100% -> 130% -> 130% -> 130% you can probably gouge out more money doing 100% -> 120% -> 125% -> 130% even if you're artificially putting the handbrake on. Like for example how they release hardcover books first, or show movies exclusively in the cinema for the first months... it's a way to forcibly upsell the fans, even though you'd gladly read a paperback or watch it on your own TV.

      Or simply scale down the size of chips and deliver a 5% performance increase for a much cheaper cost, like Intel's been doing when AMD has been out of the high end market. It's not always you want to give the market more, just because technology improves. I know our Telco was really holding out on rolling out DSL because they made more money keeping people on pay-per-minute PSTN/ISDN lines. If you have a captive/loyal customer group they can make more money doing less. There's lots of tricks you can pull off in the border area between product design and economics to maximize profit. Capitalism isn't about serving the customer, that's an occasional side effect of making profit. Never forget that.

      --
      Live today, because you never know what tomorrow brings
  3. What's a Tensor Core? by Anonymous Coward · · Score: 0

    Somebody Google it for me

    1. Re:What's a Tensor Core? by Anonymous Coward · · Score: 0

      Tensor Cores

      Tesla P100 delivered considerably higher performance for training neural networks compared to the prior generation NVIDIA Maxwell and Kepler architectures, but the complexity and size of neural networks have continued to grow. New networks that have thousands of layers and millions of neurons demand even higher performance and faster training times.

      New Tensor Cores are the most important feature of the Volta GV100 architecture to help deliver the performance required to train large neural networks. Tesla V100’s Tensor Cores deliver up to 120 Tensor TFLOPS for training and inference applications. Tensor Cores provide up to 12x higher peak TFLOPS on Tesla V100 for deep learning training compared to P100 FP32 operations, and for deep learning inference, up to 6x higher peak TFLOPS compared to P100 FP16 operations. The Tesla V100 GPU contains 640 Tensor Cores: 8 per SM.

      Matrix-Matrix multiplication (BLAS GEMM) operations are at the core of neural network training and inferencing, and are used to multiply large matrices of input data and weights in the connected layers of the network. As Figure 6 shows, Tensor Cores in the Tesla V100 GPU boost the performance of these operations by more than 9x compared to the Pascal-based GP100 GPU.
      Figure 7: Tesla V100 Tensor Cores and CUDA 9 deliver up to 9x higher performance for GEMM operations. (Measured on pre-production Tesla V100 using pre-release CUDA 9 software.)
      Figure 6: Tesla V100 Tensor Cores and CUDA 9 deliver up to 9x higher performance for GEMM operations. (Measured on pre-production Tesla V100 using pre-release CUDA 9 software.)

      Tensor Cores and their associated data paths are custom-crafted to dramatically increase floating-point compute throughput at only modest area and power costs. Clock gating is used extensively to maximize power savings.

      Each Tensor Core provides a 4x4x4 matrix processing array which performs the operation \textbf{D} = \textbf{A} \times \textbf{B} + \textbf{C}, where \textbf{A}, \textbf{B}, \textbf{C}, and \textbf{D} are 4×4 matrices as Figure 7 shows. The matrix multiply inputs \textbf{A} and \textbf{B} are FP16 matrices, while the accumulation matrices \textbf{C} and \textbf{D} may be FP16 or FP32 matrices.
      Figure 8: Tensor Core 4x4x4 matrix multiply and accumulate.
      Figure 7: Tensor Core 4x4x4 matrix multiply and accumulate.

      Each Tensor Core performs 64 floating point FMA mixed-precision operations per clock (FP16 multiply and FP32 accumulate) and 8 Tensor Cores in an SM perform a total of 1024 floating point operations per clock. This is a dramatic 8X increase in throughput for deep learning applications per SM compared to Pascal GP100 using standard FP32 operations, resulting in a total 12X increase in throughput for the Volta V100 GPU compared to the Pascal P100 GPU. Tensor Cores operate on FP16 input data with FP32 accumulation. The FP16 multiply results in a full precision result that is accumulated in FP32 operations with the other products in a given dot product for a 4x4x4 matrix multiply, as Figure 8 shows.
      Figure 9: Volta GV100 Tensor Core operation.
      Figure 8: Volta GV100 Tensor Core operation.

      During program execution, multiple Tensor Cores are used concurrently by a full warp of execution. The threads within a warp provide a larger 16x16x16 matrix operation to be processed by the Tensor Cores. CUDA exposes these operations as Warp-Level Matrix Operations in the CUDA C++ API. These C++ interfaces provide specialized matrix load, matrix multiply and accumulate, and matrix store operations to efficiently utilize Tensor Cores in CUDA C++ programs.

      In addition to CUDA C++ interfaces to program Tensor Cores directly, CUDA 9 cuBLAS and cuDNN libraries include new library interfaces to make use of Tensor Cores for deep learning applications and frameworks. NVIDIA has worked with many popular deep learning frameworks such as Caffe2 and MXNet to enable the use of Tensor Cores for deep learning research on Volta GPU based systems. NVIDIA continues to work with other framework developers to enable broad access to Tensor Cores for the entire deep learning ecosystem.

    2. Re:What's a Tensor Core? by Anonymous Coward · · Score: 0

      kthx

  4. Earth Simulator by darkain · · Score: 1

    Remember back when the Earth Simulator was new and exciting? That thing pushed apparently ~35 TFlops of compute performance. This card just announced can push 15 TFlops of compute performance. So, what you're saying, is that pretty much two of these new cards is about the same performance profile as the Earth Simulator? (of course, different architecture entirely, less ram, without storage, etc)

    https://en.wikipedia.org/wiki/...

    1. Re:Earth Simulator by Impy+the+Impiuos+Imp · · Score: 2

      When I got my first CUDA card, my Seti@home totals over 7 years doubled in two weeks. My next upgrade redoubled all that in three months.

      This would redouble all that in days. I concluded there was little need to sweat working on it all along because doing nothing all those years, then buying one of these, say, would only put you a week or two behind where you'd otherwise be

      --
      (-1: Post disagrees with my already-settled worldview) is not a valid mod option.
    2. Re:Earth Simulator by Junta · · Score: 1

      Well, one, that was 1997.

      Two, the comparison would be 7.5 Tflops, since V100 is 7.5 DP64, and top500 focuses exclusively on FP64 performance

      Three, we are comparing Rpeak to Rmax (and Rpeak is increasingly not sensible).

      Of course, all that said it's still an impressive acheivement, and their big headline about 120 'tensor tflops' is what they seem particularly focused on, though I have no sense of how impressed I should or shouldn't be, since I don't know tensor performance so much.

      --
      XML is like violence. If it doesn't solve the problem, use more.
    3. Re:Earth Simulator by Karmada · · Score: 1

      35TFlops was the benchmarked (double precision) performance, and 15TF is theoretical peak single precision performance of the V100. You need double precision for these simulations. V100 has about 7.5TFlops double precision peak performance (with Fused Multiply-Add) So in real world performance you would need like 10 servers each with 4 V100 GPUs to match the performance (stacked on top of each other connected with Infiniband). You can also put 256GB RAM into each server (and 10TB NVM). It would still have much less RAM, but should be able to beat the Earth Simulator handily due to the extra and very concentrated arithmetic performance.

  5. Musk? by Anonymous Coward · · Score: 0

    Elon Musk was not available for comment.

  6. blurb porn by epine · · Score: 1

    Nvidia notes the dedicated Tensor cores also allow for a 12x uplift in deep learning performance compared to Pascal, which relies solely on its CUDA cores.

    Long ago, the television spent many years instructing me that "lifts and separates" is the real cigar. Accept no substitutes. That's the key.

  7. Imagine by Anonymous Coward · · Score: 0

    Imagine a Beowulf cluster of these things.

    1. Re:Imagine by Anonymous Coward · · Score: 0

      No need to imagine. These will go to the Summit and the other CORAL systems with the Power 9s,

  8. Does anybody actually use FP32? by Anonymous Coward · · Score: 0

    We tried using FP32 in a time based system with minute resolution and were getting off-by-one errors all the time due to floating point rounding - FP32s have a 24-bit significand and and 8-bit scale.

    1. Re:Does anybody actually use FP32? by Anonymous Coward · · Score: 0

      We tried using FP32 in a time based system with minute resolution and were getting off-by-one errors all the time due to floating point rounding - FP32s have a 24-bit significand and and 8-bit scale.

      High performance computing apps like discrete time physics simulators need FP64 and of course continue to be supported on Volta at historical performance levels, but the new fad of Deep Learning based Convolutional Neural Networks tend to use FP32 for training which is generally sufficient to make sure the weights converge (if they converge at all) with typical back-propagation algorithms. However, for inference (ie. evaluating the trained network on novel input data), CNNs often only need FP16 or even INT8 with limited loss in recall accuracy.

      When using INT8 inference (Google's TPU chip and also supported since Nvidia Pascal-GPUs), you often have to be very careful with scaling/normalizing every stage and it often limited in the breadth of supportable networks, but of course with FP16 with FP32 accumulators, you can be much lazier about proper scaling/normalizing and support very large networks before accuracy degrades.

  9. they have been doing it for nearly 20 years. by gl4ss · · Score: 0

    look, why believe nvidia now when they have been bullshitting about it in the past so long? they were selling for a long time exact same chips with an out of chip resistor deciding if it would accept the pro opengl drivers or not. just a flip switch. nothing more. meaning pretty much for a long time that if you bought a quadro you were a sucker. still are. so give him some slack.

    ati has been doing exact same though, so there's that.

    and you don't really need an output port on the card doing the accelerating now... nvidia knows all about the premium tax it can put on the pro stuff. it's just a calculated shenigans to get more money out of the same chips.

    --
    world was created 5 seconds before this post as it is.
    1. Re:they have been doing it for nearly 20 years. by guruevi · · Score: 1

      There are various differences, as I pointed out, the resistor hack is mostly an urban myth.

      Yes, you can make a GeForce appear to be a Quadro or even a Tesla (and trick some proprietary software) but the GeForce still won't have the same double precision performance, ECC memory or thermal management, a Quadro will still be twice as fast as your mod and more importantly, won't crash. You pay $4k for the card because the performance, stability and memory increases over the GeForce is worth it, and yes, I have experience with them, because I thought the same thing initially, paid $1500 for one of the "best" GeForce (I only use it with open source software and did use the Tesla drivers/firmware so there was no artificial limit) and still ended up with 2 Tesla's instead.

      The chipset may be the same but that doesn't mean all the features work the same way. In the same sense that Intel's chipsets are all the same from Celeron to Xeon, you can't say that they're just a resistor away from flooding the market with Xeons.

      --
      Custom electronics and digital signage for your business: www.evcircuits.com
  10. won't be affordable for non-corporate buyers by 0111+1110 · · Score: 1

    Unfortunately these cards will probably cost so much that only corporations, aka 'full citizens,' will be able to afford them. Too bad. I'd love to play around with neural net stuff if the cards were not more expensive than their regular graphics cards, but obviously that's not going to happen. How is it that these companies used to be able to make a profit at $299 for their high end cards? Did their costs rise so dramatically?

    --
    Quite an experience to live in fear, isn't it? That's what it is to be a slave.
    1. Re:won't be affordable for non-corporate buyers by Junta · · Score: 1

      Well, at this phase, it won't even physically go into anything apart from server designs built specifically around this specific card.

      Here there's an issue of volumes. While the enthusiast gaming market is small, the number of units to move of this sort of accelerator makes it look gigantic by comparison. It's interesting, since nVidia began coming to prominence when people started figuring out how to use off the shelf GPU to accelearte HPC workload, because the accelerator market couldn't deliver a viable product for an acceptable price.

      Now, it seems they've come full circle and are able to sell exhorbitantly expensive accelerators, with very little in common with their GPUs anymore..

      --
      XML is like violence. If it doesn't solve the problem, use more.
    2. Re:won't be affordable for non-corporate buyers by Anonymous Coward · · Score: 0

      Usually it's double floating-point precision that gives them the ability to charge the equivalent of a top-end off-road 4x4. That's only really needed by the scientific visualization people for CFD simulation to guarantee high precision accuracy for detailed simulations with micro-vortices. For everyone else regular floating point values are good enough. Some cards might give you a few dozen double-point precision GPU cores. The other overhead is that once you are doing simulations where the data is spread across thousands of GPU core, you need to get them to send messages directly to each other with no latency. That requires a custom interconnect like hypercubes or fractal networks.

    3. Re:won't be affordable for non-corporate buyers by Anonymous Coward · · Score: 0

      Here there's an issue of volumes. While the enthusiast gaming market is small, the number of units to move of this sort of accelerator makes it look gigantic by comparison.

      The physical card is not the thing that gets a volume discount. We've seen from kickstarters and the overdiversity of stupid Android phones that the marginal cost on ASIC and electronics manufacturing flattens out pretty quickly. The volume discount is on the IP: what have all those primadonna engineers been doing for the last couple years? It's the same IP in all the cards.

      What's going on is the same scheme that's always gone on: price discrimination. Instead of normal supply and demand they're drilling down to the demand curve at multiple points to tap it for that sweet green.

  11. In Laymans terms by Torontoman · · Score: 1

    What % improvement implications is this for say 'supercomputers' of 1,3,5,10 years ago?

  12. Can we please stop calling these GPUs? by hackel · · Score: 2

    These are co-processors. Basically entire second computers added alongside the primary. GPU functions are only a minor part of their capabilities. It's like calling my mobile device a "phone" because it has one app called "Phone" which I use twice a year.

    Does no one remember when installing a match co-processor in your PC was the new hotness? This is the same thing!

  13. And now, here are the thousands of cores by Impy+the+Impiuos+Imp · · Score: 1

    My one and only submission to make it to the main page. 9 years.

    Smoked by NVIDIA. Nobody wants Pentium cores anymore anyway.

    --
    (-1: Post disagrees with my already-settled worldview) is not a valid mod option.