NVIDIA details how it doubled FP32 performance

A few days after the announcement of its new GeForce RTX, NVIDIA begins to deliver the secrets. While we are still awaiting the document detailing the general public implementation of the Ampere architecture, the first secrets are beginning to be revealed. In particular on the computing power.

We saw this when the new GeForce RTX 3070, 3080 and 3090 were announced, despite two years of continuous rumors and other leaks, many predictions fell off the table. From the traversal coprocessor, to the 5,376 CUDA Cores (the product of 21 by 256) to the 7nm engraving, almost everything was wrong.

Ampere hid many mysteries

But there is a secret better kept than the others by the manufacturer, which is the cause of the surprise of many on the performance / price ratio or the energy efficiency of Ampere: it is its “bi-GPU by design” aspect. “. Understand that the manufacturer has almost doubled the number of CUDA Cores compared to the previous generation.

GeForce RTX 30 Series Ampere

Admittedly, moving from 12nm to an 8nm process signed by Samsung (preferred to TSMC’s 7nm because of its cost and availability) helps. But that is not enough to explain this result. Some have even questioned here and there the veracity of the figures announced, which have something to dream of: 20, 30 and 36 TFLOPS of computing power.

As we recalled in our previous article on Ampere’s 3D rendering performance, the GeForce RTX 2080 announced 10.1 TFLOPs against 13.6 TFLOPS for the 2080 Ti and 16.3 TFLOPS for the RTX 8000 panel.

See also  Even more problems with the Samsung Galaxy S20: Pokémon Go keeps freezing

Double the computing power, but not all units

However, NVIDIA’s speech was clear to those who know how to listen: the number of FP32 units has been doubled. Those in charge of the calculation on 32-bit floats, essential in the field of video games. Less than others, in charge of calculations on integers (INT) or special mathematical functions (SFU).

The solution was therefore necessarily there: a modification of the balance within Streaming Multiprocessors (SM). Our bet was that NVIDIA had found a way to double the number of FP32s, the famous CUDA Cores, at a lower cost. Without necessarily doing the same with the rest of the architecture. It was the most logical explanation.

TuringTuringTuring had introduced competitive execution between FP32 and INT32 units, Ampere doubles the FP32 units

The NVIDIA team has just confirmed it within d’un AMHA organized on Reddit:

« Ampere’s SMs incorporate a new design of datapaths that can handle FP32 and INT32 operations. One consists of 16 FP32 CUDA Cores per partition capable of handling 16 FP32 instructions per clock cycle. The other 16 CUDA Cores FP32 and 16 Cores INT32.

Thus, each Ampere SM can process 32 FP32 instructions per cycle or 16 FP32 and 16 INT 32 per cycle. Each combination of 4 SMs can process 128 FP32 instructions per cycle, or Turing’s double, or 64 FP32 and 64 INT32. »

Thus, it is indeed the number of FP32 units which has been doubled and only him, without being the same with the INT32s. We lack details on the granularity of the thing. At what point will it be possible to mix FP32 and INT32 instructions in the datapath capable of handling both in addition to the one exclusively FP32?

See also  Xiaomi Black Shark 3 Pro Teaser Parade: Extendable shoulder buttons for the ultimate gaming experience - Notebookcheck

It will also be interesting to see how the rest of the architecture has been adapted. NVIDIA mentions a doubled L1 cache (from 116 GB / s of L1 throughput to 219 GB / s). Each GPC is also entitled to two partitions of 8 ROPs, Turing’s double. This despite the fact that the memory interface does not evolve.

The builder should deliver all the keys to the architecture before September 14. We will know more then.

Amount of memory, DLSS 2.1, NVENC / NVDEC

This exchange was not only the occasion to talk about the computing power of the new GeForce RTXs. We also learn that the 10 GB of GDDR6X has been deemed more than sufficient for the RTX 3080 and gaming in 4K:

« If you watch Shadow of the Tomb Raider, Assassin’s Creed Odyssey, Metro Exodus, Wolfenstein Youngblood, Gears of War 5, Borderlands 3 and Red Dead Redemption 2 on this definition with this map, at maximum settings (including high texture packs definition) with RTX when the game supports it, you are in the 60 to 100 fps with 4 to 6 GB of memory used ».

In fact, models with 16 GB or more of memory are generally designed to “attract the thief” with a marketing argument easy to display big on the box, which finds its justification in professional applications rather than in games, even. latest generation. But it is often quite effective.

We also expect AMD to offer its Big Navi (2X, RDNA2) with 16 GB of memory. Will NVIDIA then respond with an RTX 3080 Super 20 GB? We will see in due course.

See also  Snapdragon XR2: Qualcomm shows a reference design for a 5G VR / AR headset - Notebookcheck

We also learn that DLSS is going to version 2.1 with support for 8K and an x9 scaling option, capable of handling VR. The SDK has been updated and available to developers. The team confirms the information given in its technical sheets on the multimedia part: the NVENC video compression engine (7th generation) is unchanged from Turing. Decompression (NVDEC) is upgraded to 5th generation for AV1 support.


Leave a Comment