NVIDIA Creates a 15B-Transistor Chip With 16GB Bandwidth Memory For Deep Learning (venturebeat.com)
An anonymous reader cites a report on VentureBeat: NVIDIA chief executive Jen-Hsun Huang announced that the company has created a new chip, the Tesla P100, with 15 billion transistors, 16GB high-bandwidth memory for deep-learning computing. It's the biggest chip ever made, Huang said. "We decided to go all-in on A.I.," Huang said. "This is the largest FinFET chip that has ever been done." The chip has 15 billion transistors, or three times as much as many processors or graphics chips on the market. It takes up 600 square millimeters. The chip can run at 21.2 teraflops. Huang said that several thousand engineers worked on it for years. Jim McGregor, writing for Forbes (the link is not accessible to ad-blocking tool users): It features NVIDIA's new Pascal GPU architecture, the latest memory and semiconductor process, and packaging technology -- all to create the densest compute platform to date. In addition, it combines 16GB of die stacked second-generation High-Bandwidth Memory (HBM2). The memory and GPU are combined into a multichip module on a state-of-the-art silicon substrate. The P100 has NVIDIA's NVLink interface technology to connect to multiple Tesla P100 GPU modules.
From what my friends who work at nVidia tell me, most engineers work on all projects. They get sent problems from one GPU, after fixing that, start working on issues from a CPU or some other project.
Yes it can do 21,6 teraflop.... at FP16.... half precision...it can "only" do 10,6 teraflop at single precision and 5,3 teraflop at double (64) precision. Also it doesn't have 1TB/sec advertised (for months) HBM2 memory speed, but only 720GB/sec
There are several factors. First of all, what they are building is a HUGE engineered system which would have taken up a couple of buildings a decade or two ago. The fact that the end product is small doesn't change the complexity. The second part is the fact that it IS so small, which brings its own complications. In addition, semiconductor manufacturing is a very tricky business where even making the simplest thing (e.g., a transistor) takes an enormous amount of planning, characterization, and tool design.
Part of it is the R&D -- nothing like this has been done before, so certain things have to be figured out (heat dissipation, how the proximity of the components effect the other components,stuff neither of us will understand, etc. etc). Another huge part is tooling and process -- someone has to design, test and characterize the fabrication tools and processes (the "automation" you speak of has to be built by someone -- a device this complicated probably can't be built without the automation). The chip is divided into subsystems each of which needs to be designed, simulated, and optimized. Someone has to integrate all the subsystems and simulate them together. The 1000 people probably include material scientists, process engineers, electrical engineers of various stripes, semiconductor physicists, mechanical engineers (heat dissipation, packaging, etc)., systems engineers, engineering project managers, etc.
One organizes many contributions using any number of industry-standard design methodologies. Designing airplanes and cars uses even more engineers.
I suspect NVIDIA is slightly exaggerating and are counting the contribution of many "overhead" engineers that provide value for the whole engineering organization, such as people who work on design tools, design kits, methodology and the like.
You're right, there are many repeated subunit but each unit needs a team to be optimized.
For a chip this complex you need:
Logic Designers (who come up with high-level models for the chip and define the instruction set / hardware interface)
Front-end engineers that write Verilog and/or VHDL (I have no idea what NVIDIA uses)
Implementation engineers (who do place and route and parasitic extraction)
Verification engineers (who use various tools to see if everything is as it should be)
Packaging engineers (who work closely with vendors to develop a custom package for the chip/module)
Module engineers (since we have 3D stacked memories on this device the module engineering is far from trivial)
Thermal Engineers (3D modules typically have very complex thermal requirements)
Signal Integrity engineers (since we're going so fast just getting a signal from point A to point B is hard)
Analog/Mixed Signal engineers (for clocking, serial I/O development)
Integration Engineers (for modeling how to put all this together)
System Engineers (for figuring out if this is all going to work)
Software Engineers (for low-level software dev)
CAD Engineers (for developing and maintaining an appropriate computer-aided design flow)
Foundry Engineers (for working with the foundry on the physical production of the wafers... anything this big and complex will need process customization)
ESD engineers (for figuring out and implementing an ESD strategy)
Library Engineers (for customizing and optimizing the standard cell library used in the chip)
Product Engineers (for solving production problems as they arise)
Test Engineers (for developing and implementing tests to show the chip is working as expecting)
Application Engineers (who work with early adopters to integrate this chip into their systems)
and on and on and on...
As you can see, an army of engineers is required for a chip this complex to see the light of day. On simpler chips, many of these roles can be played by the same people, but in a chip this big, they need to divide the work or it would never get done.
It's a FinFET device. You can represent more than 1 binary bit per transistor by using multi-gate transistors.
This is not a factually correct statement. Multi-gate transistors are used because they are more energy-efficicient, perform better, and can be scaled to smaller dimensions than traditional planar CMOS devices. The extra gates give better electrostatic control over the MOSFET channel, but they do not allow the device to perform operations on more than one bit of data at once.
https://en.wikipedia.org/wiki/Multigate_device
Oh, for God's sake, I ignored this at first but now it's been modded up.
15 billion is the transistor count for the GPU logic. It's not the transistor count for the HMB2 memory installed alongside the GPU on the interposer. Adding an interposer does not suffice to make it all the same chip (hint from TFS: "multichip module").
FinFET is neither necessary nor sufficient to for multi-level-cell-like bit representation. That's also a flash storage technology, not a logic or volatile memory technology (at least in mass produced products).
It's 15 days to Weed Day. Put down whatever you're smoking and get back to work.