Pascal was one of the biggest leaps in gaming performance that NVIDIA had achieved in recent years after a spate of incremental updates. Maxwell was quite the game changer as well but nowhere close to what Pascal achieved. However, Volta, NVIDIA’s next GPU architecture which will succeed Pascal seems to be all set to clobber Pascal. At the recent GPU Technology Conference in San Jose, NVIDIA CEO Jen-Hsun Huang unveiled the first card to be built using the Volta microarchitecture, the Tesla V100. Going by the Tesla moniker in the name, you should have guessed that this is no gaming graphics card, and you’re right. NVIDIA, like every other technology company, out there brings out its latest GPU technology in the enterprise space before unleashing the consumer oriented gaming graphics cards. So the GeForce branded GTX 2080, GTX 2070 will be coming out much later. We expect them in the first half of 2018. However, in the meanwhile, we can take a closer look at Volta.
Tesla V100: Built for Machine Learning and High-Performance Computing
Tesla cards are increasingly being deployed by not only Universities that have a need for HPC clusters but even up and coming startups operating in the Artificial Intelligence / Machine Learning domain. It’s this very domain that has posed some unique challenges with respect to the kind of compute tasks that Tesla cards have to handle. And to meet the ridiculously growing needs of High-Performance Computing, the Tesla V100 has incorporated an entirely new component in the GPU die, Tensor Cores.
You’ve obviously heard of CUDA cores, NVIDIA’s unified architecture called Compute Unified Device Architecture which did away with specialised components to come up with a jack-of-all-trades processing unit. Hundreds and even thousands of these CUDA cores are packed into each GPU that lies at the heart of each graphics card. However, while traditional computing tasks have significantly benefited from the GPGPU compute methodology, the needs of AI and ML vary a little. Enough to warrant an entirely new set of Cores. These Tensor Cores will sit alongside the existing CUDA cores and will step in when needed to run GEMM operations. With the introduction of these Tensor Cores, NVIDIA is claiming that the Tesla V100 can train the ResNet-50 deep neural network about 2.4x faster than the previous gen Tesla P100 based off the Pascal architecture. Moreover, the V100 can perform inference with the ResNet-50 deep neural network about 3.7x faster than the P100. As it stands right now, the Tesla V100 can outdo the Tesla P100 by an average factor of at least 1.5x across operations like DGEMM, FFT, Physics, Seismic and STREAM. All of these are GPGPU compute tasks generally undertaken by enterprise graphics cards.
Peering down the microscope – NVIDIA Volta GV100 GPU
As was the case with all previous graphics cards since CUDA was introduced, even the Volta-based GV100 follows the similar hierarchy of organising the entire GPU into six GPCs (Graphics Processing Clusters), each GPC consists of 42 TPCs (Texture Processing Clusters), each TPC has two SMs (Streaming Multiprocessors) bringing the total number to 84 SMs, and lastly, eight 512-bit memory controllers. Each SM is further divided into 64 FP32 cores, 64 INT32 cores, 32 FP64 cores, 4 texture units and 8 Tensor cores. A fully functional GV100 GPU, thus, has a massive 5376 CUDA cores!
However, owing to the yields such a configuration will never be seen. The Tesla V100 has GV100 GPUs that have 80 active SMs. So effectively, the Tesla V100 has 5120 CUDA cores which is a 42.9% increment over the Tesla P100. Now that’s a massive improvement.
Up until Pascal, GPUs had the same cores handling FP32 and INT32 operations so one had to wait before the other operation could be performed. But with Volta, there now exist dedicated cores for FP32 and INT32 allowing simultaneous execution. The GV100 SM incorporates 64 FP32 cores, 32 FP64 cores and 64 INT32 cores divided across four blocks inside an SM. So you now have more blocks that can simultaneously hook up with the twice the number of threads to get more done in the same amount of time.
Tensor Cores for AI and ML
In the benchmarks mentioned earlier, NVIDIA pointed out how the GV100 was capable of performing way better than the previous gen GP100 thanks to the newly incorporated Tensor Cores. So what action do they perform?
Well, neural networks take large sets of data and then trace patterns within them. This process involves categorising data into neat little sets which are then given weights based on how important they are. This can take a really long time and the calculation involves Matrix Multiplication. Now Matrices are an arrangement of numbers in grids and matrix multiplication involves multiplying such grids with each other. They are very complex calculations and in the world of computing, ‘multiplication’ is one of the most time-consuming operations. Now imagine how painful multiplying matrices would be. Math causes tough computers to fall to their knees and cry as well.
These Tensor cores are built especially to tackle these complex multiplications.The Tensor cores in the Tesla V100 can churn out nine times as many operations as the previous gen Tesla P100 can. Which is why the ResNet-50 training time improved by such a large factor.
Faster HBM2 Memory
HBM2 (High Bandwidth Memory) memory was present in the Pascal-based Tesla P100 as well but this new generation of HBM2 from Samsung coupled with the improved memory controller reportedly delivers 1.5x the memory bandwidth compared to Pascal. HBM2 specs call for eight memory dies stacked one upon the other and is rated for 256 GB/s per package. The memory cluster on the V100 is rated to deliver 900 GB/s which means we should see four HBM2 packages on the Tesla V100. The drop in bandwidth is attributed to reduced clock speeds and transfer efficiency of 95%.
Introduced back in 2014, NVLink is NVIDIA’s protocol for communicating between semiconductors over short distances. With the Volta cards, NVIDIA has introduced NVLink 2.0 which isn’t that big of an improvement over the previous gen. But it makes for a notable rise in performance nonetheless. The GV100 supports up to 6 NVLink connections, each of which will be transferring 25 GB/s bringing the total bandwidth to 300 GB/s. Such high bandwidth is necessitated by the fact that AI and ML require moving large data sets between different processing units and the faster each V100 can talk to another, the higher performance numbers will be.
So which of these will be there on the GTX 2080?
Consumer grade graphics cards don’t need such insane bandwidth as graphical data isn’t as demanding at popular resolutions. But it makes for good future-proofing as the minimum resolution that games will be played at goes up. As of now, the GTX 1080 Ti is a pretty decent graphics card for 4K gameplay and for current gen VR as well. But with HTC having already announced that they won’t be coming out with the next generation of the Vive unless they can bring about a significant improvement, it’s up to hardware manufacturers, namely the CPU and GPU manufacturers to step up their game. So should a Vive 2 be on the horizon with dual 4K displays for each eye, then a Volta based GTX 2080 might as well carry HBM2 memory. Or else, we’ll see GDDR6 for a generation before HBM2 or perhaps HBM3 gets picked for consumer GPUs. So perhaps the GTX 3080 might carry them. Only time will tell.