N卡Fermi架构

Fig. 1. NVIDIA Fermi architecture
Convention in figures: orange - scheduling and dispatch; green - execution; light blue -registers and caches.

Fermi (microarchitecture)

From Wikipedia, the free encyclopedia

Fermi is the codename for a GPU microarchitecture developed by Nvidia as the successor to the Tesla microarchitecture. It was the primary microarchitecture used in the GeForce GeForce 400 Series and GeForce 500 Series. It was followed by Kepler, and used alongside Kepler in the GeForce 600 SeriesGeForce 700 Series, and GeForce 800 Series, in the latter two only in mobile GPUs. All desktop Fermi GPUs were manufactured in 40 nm, mobile Fermi GPUs in 40 nm and 28 nm.

Overview[edit]

Fig. 1. NVIDIA Fermi architecture
Convention in figures: orange - scheduling and dispatch; green - execution; light blue -registers and caches.

Fermi Graphic Processing Units (GPUs) feature 3.0 billion transistors and a schematic is sketched in Fig. 1.

  • Streaming Multiprocessor (SM): composed by 32 CUDA cores (see Streaming Multiprocessor and CUDA core sections).
  • GigaThread globlal scheduler: distributes thread blocks to SM thread schedulers and manages the context switches between threads during execution (see Warp Scheduling section).
  • Host interface: connects the GPU to the CPU via a PCI-Express v2 bus (peak transfer rate of 8GB/s).
  • DRAM: supported up to 6GB of GDDR5 DRAM memory thanks to the 64-bit addressing capability (see Memory Architecture section).
  • Clock frequency: 1.5 GHz (not released by NVIDIA, but estimated by Insight 64).
  • Peak performance: 1.5 TFlops.
  • Global memory clock: 2 GHz.
  • DRAM bandwidth: 192GB/s.

Streaming Multiprocessor[edit]

Each SM features 32 single-precision CUDA cores, 16 load/store units, four Special Function Units (SFUs), a 64KB block of high speed on-chip memory (see L1+Shared Memory subsection) and an interface to the L2 cache (see L2 Cache subsection).

Load/Store Units: Allow source and destination addresses to be calculated for 16 threads per clock. Load and store the data from/to cache or DRAM.

Special Functions Units (SFUs): Execute transcendental instructions such as sin, cosine, reciprocal, and square root. Each SFU executes one instruction per thread, per clock; a warp executes over eight clocks. The SFU pipeline is decoupled from the dispatch unit, allowing the dispatch unit to issue to other execution units while the SFU is occupied.

CUDA core[edit]

Integer Arithmetic Logic Unit (ALU): Supports full 32-bit precision for all instructions, consistent with standard programming language requirements. It is also optimized to efficiently support 64-bit and extended precision operations.

Floating Point Unit (FPU): Implements the new IEEE 754-2008 floating-point standard, providing the fused multiply-add (FMA) instruction (see Fused Multiply-Add subsection) for both single and double precision arithmetic. Up to 16 double precision fused multiply-add operations can be performed per SM, per clock.

Fused Multiply-Add[edit]

Fused Multiply-Add (FMA) perform multiplication and addition (i.e., A*B+C) with a single final rounding step, with no loss of precision in the addition. FMA is more accurate than performing the operations separately.

Warp Scheduling[edit]

The Fermi architecture uses a two-level, distributed thread scheduler.

Each SM can issue instructions consuming any two of the four green execution columns shown in the schematic Fig. 1. For example, the SM can mix 16 operations from the 16 first column cores with 16 operations from the 16 second column cores, or 16 operations from the load/store units with four from SFUs, or any other combinations the program specifies.

Note that 64-bit floating point operations consumes both the first two execution columns. This implies that an SM can issue up to 32 single-precision (32-bit) floating point operations or 16 double-precision (64-bit) floating point operations at a time.

GigaThread Engine: The GigaThread engine schedules thread blocks to various SMs

Dual Warp Scheduler: At the SM level, each warp scheduler distributes warps of 32 threads to its execution units. Threads are scheduled in groups of 32 threads called warps. Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. The dual warp scheduler selects two warps, and issues one instruction from each warp to a group of 16 cores, 16 load/store units, or 4 SFUs. Most instructions can be dual issued; two integer instructions, two floating instructions, or a mix of integer, floating point, load, store, and SFU instructions can be issued concurrently. Double precision instructions do not support dual dispatch with any other operation.

Memory[edit]

L1 cache per SM and unified L2 cache that services all operations (load, store and texture).

Registers: Each SM has 32KB of registers. Each thread has access to its own registers and not those of other threads. The maximum number of registers that can be used by a CUDA kernel is 63. The number of available registers degrades gracefully from 63 to 21 as the workload (and hence resource requirements) increases by number of threads. Registers have a very high bandwidth: about 8,000 GB/s.

L1+Shared Memory: On-chip memory that can be used either to cache data for individual threads (register spilling/L1 cache) and/or to share data among several threads (shared memory). This 64 KB memory can be configured as either 48 KB of shared memory with 16 KB of L1 cache, or 16 KB of shared memory with 48 KB of L1 cache. Shared memory enables threads within the same thread block to cooperate, facilitates extensive reuse of on-chip data, and greatly reduces off-chip traffic. Shared memory is accessible by the threads in the same thread block. It provides low-latency access (10-20 cycles) and very high bandwidth (1,600 GB/s) to moderate amounts of data (such as intermediate results in a series of calculations, one row or column of data for matrix operations, a line of video, etc.). David Patterson says that this Shared Memory uses idea of local scratchpad[1]

Local Memory: Local memory is meant as a memory location used to hold "spilled" registers. Register spilling occurs when a thread block requires more register storage than is available on an SM. Local memory is used only for some automatic variables (which are declared in the device code without any of the __device__, __shared__, or __constant__ qualifiers). Generally, an automatic variable resides in a register except for the following: (1) Arrays that the compiler cannot determine are indexed with constant quantities; (2) Large structures or arrays that would consume too much register space; Any variable the compiler decides to spill to local memory when a kernel uses more registers than are available on the SM.

L2 Cache: 768 KB unified L2 cache, shared among the 16 SMs, that services all load and store from/to global memory, including copies to/from CPU host, and also texture requests. The L2 cache subsystem also implements atomic operations, used for managing access to data that must be shared across thread blocks or even kernels.

Global Memory: Accessible by all threads as well as host (CPU). High latency (400-800 cycles)

See also[edit]

References[edit]

  1. Jump up^ Patterson, David (September 30, 2009). "The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges". Parallel Computing Research Laboratory & NVIDIA. Retrieved 3 October 2013.[dead link]
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
TABLE OF CONTENTS Introduction to the NVIDIA Turing Architecture ....................................................................1 NVIDIA Turing Key Features.......................................................................................................... 3 New Streaming Multiprocessor (SM) ....................................................................................... 3 Turing Tensor Cores................................................................................................................. 4 Real-Time Ray Tracing Acceleration ......................................................................................... 4 New Shading Advancements.................................................................................................... 4 Mesh Shading...................................................................................................................... 4 Variable Rate Shading (VRS)................................................................................................ 5 Texture-Space Shading........................................................................................................ 5 Multi-View Rendering (MVR)............................................................................................... 5 Deep Learning Features for Graphics....................................................................................... 5 Deep Learning Features for Inference...................................................................................... 6 GDDR6 High-Performance Memory Subsystem....................................................................... 6 Second-Generation NVIDIA NVLink .......................................................................................... 6 USB-C and VirtualLink............................................................................................................... 6 Turing GPU Architecture In-Depth ........................................................................................7 Turing TU102 GPU........................................................................................................................ 7 Turing Streaming Multiprocessor (SM) Architecture.................................................................. 11 Turing Tensor Cores............................................................................................................... 15 Turing Optimized for Datacenter Applications........................................................................... 16 Turing Memory Architecture and Display Features.................................................................... 20 GDDR6 Memory Subsystem................................................................................................... 20 L2 Cache and ROPs................................................................................................................. 21 Turing Memory Compression................................................................................................. 22 Video and Display Engine ....................................................................................................... 22 USB-C and VirtualLink................................................................................................................. 24 NVLink Improves SLI ................................................................................................................... 24 Turing Ray Tracing Technology............................................................................................26 Turing RT Cores .......................................................................................................................... 31 NVIDIA NGX Technology .....................................................................................................34 NGX Software Architecture ........................................................................................................ 34 Deep Learning Super-Sampling (DLSS) ....................................................................................... 35 InPainting ................................................................................................................................... 38 AI Slow-Mo............................................................................................................................. 39 AI Super Rez........................................................................................................................... 39 NVIDIA Turing GPU Architecture WP-09183-001_v01 | iii Turing Advanced Shading Technologies ..............................................................................40 Mesh Shading............................................................................................................................. 40 Variable Rate Shading................................................................................................................. 43 Content Adaptive Shading...................................................................................................... 45 Motion Adaptive Shading....................................................................................................... 46 Foveated Rendering ............................................................................................................... 47 Texture Space Shading ............................................................................................................... 48 The Mechanics of TSS............................................................................................................. 49 Multi-View Rendering................................................................................................................. 51 Multi-View Rendering Use Cases............................................................................................ 52 Resource Management and Binding Model ............................................................................... 54 Turing Features Enhance Virtual Reality ..............................................................................55 Conclusion ..........................................................................................................................57 Appendix A Turing TU104 GPU ............................................................................................58 Appendix B Turing TU106 GPU ...........................................................................................63 Appendix C RTX-OPS Description ........................................................................................66 The Hybrid Rendering Model ..................................................................................................... 66 RTX-OPS Workload-based Metric Explained............................................................................... 67 Appendix D Ray Tracing Overview .......................................................................................69 Basic Ray Tracing Mechanics...................................................................................................... 70 Bounding Volume Hierarchy .................................................................................................. 71 Denoising Filtering...................................................................................................................... 73 NVIDIA Turing GPU Architecture WP-09183-001_v01 | iv LIST OF FIGURES Figure 1. Turing Reinvents Graphics............................................................................................ 2 Figure 2. Turing TU102 Full GPU with 72 SM Units ..................................................................... 8 Figure 3. NVIDIA Turing TU102 GPU.......................................................................................... 10 Figure 4. Turing TU102/TU104/TU106 Streaming Multiprocessor (SM).................................... 12 Figure 5. Concurrent Execution of Floating Point and Integer Instructions in the Turing SM.... 13 Figure 6. New Shared Memory Architecture............................................................................. 14 Figure 7. Turing Shading Performance Speedup versus Pascal on Many Different Workloads. 14 Figure 8. New Turing Tensor Cores Provide Multi-Precision for AI Inference............................ 16 Figure 9. Tesla T4 delivers up to 40X Higher Inference Performance........................................ 17 Figure 10. Tesla T4 Delivers More than 50X the Energy Efficiency of CPU-based Inferencing .... 18 Figure 11. Turing GDDR6 ............................................................................................................. 21 Figure 12. 50% Higher Effective Bandwidth ................................................................................ 22 Figure 13. Video Feature Enhancements..................................................................................... 23 Figure 14. NVLink Enables New SLI Display Topologies............................................................... 25 Figure 15. SOL MAN from NVIDIA SOL Ray Tracing Demo (See Demo) ....................................... 27 Figure 16. Hybrid Rendering Pipeline .......................................................................................... 28 Figure 17. Details of Ray Tracing and Rasterization Pipeline Stages............................................ 29 Figure 18. From Reflections Demo .............................................................................................. 30 Figure 19. Ray Tracing Pre Turing ................................................................................................ 32 Figure 20. Turing Ray Tracing with RT Cores................................................................................ 32 Figure 21. Turing Ray Tracing Performance................................................................................. 33 Figure 22. Turing with 4K DLSS is Twice the Performance of Pascal with 4K TAA....................... 35 Figure 23. DLSS 2X versus 64xSS image almost Indistinguishable................................................ 36 Figure 24. DLSS 2X Provides Significantly Better Temporal Stability and Image Clarity Than TAA ......................................................................................................... 37 Figure 25. NGX InPainting Examples, Missing Image Data Is Intelligently Replaced with Meaningful Image Information................................................................................... 38 Figure 26. AI Super Rez Provides Improved Image Clarity Over Other Filtering Methods.......... 39 Figure 27. Mesh Shading, Visually Rich Images ........................................................................... 40 Figure 28. Current Graphics Pipeline versus a Graphics Pipeline with Task and Mesh Shaders.. 41 Figure 29. Screenshot from the Asteroid Field Demo.................................................................. 42 Figure 30. An Asteroid at Low and High Levels of Detail (LOD) ................................................... 42 Figure 31. Dynamically Computed, Spherical Cutaway of a Koenigsegg Model, Viewed in NVIDIA Holodeck™..................................................................................... 43 Figure 32. Turing VRS Supported Shading Rates and Example Application to a Game Frame..... 44 Figure 33. Example of Content Adaptive Shading........................................................................ 46 NVIDIA Turing GPU Architecture WP-09183-001_v01 | v Figure 34. Perceived Blur Due to Object Motion Combined with Retinal and Display Persistence ..................................................................................................... 47 Figure 35. Traditional Rasterization and Shading Process........................................................... 49 Figure 36. Texture Space Shading Process................................................................................... 50 Figure 37. Texture Space Shading for Stereo............................................................................... 51 Figure 38. 200° FOV HMD Where Two Canted Panels are Used and Benefit from MVR............. 53 Figure 39Figure 37 MVR Single Pass Cascaded Shadow Map Rendering .................................... 54 Figure 40. Turing Features for VR................................................................................................ 56 Figure 41. Turing TU104 Full Chip Diagram ................................................................................. 59 Figure 42. Turing TU106 Full Chip Diagram ................................................................................. 64 Figure 43. Workload Distribution Over One Turing Frame Time ................................................. 66 Figure 44. Peak Operations of Each Type Base for GTX 2080 Ti .................................................. 68 Figure 45. Basic Ray Tracing Process ........................................................................................... 70 Figure 46. Abstraction of Tree Traversal and a Ray Intersecting Different Levels of Bounding Boxes.......................................................................................................... 72 Figure 47. Shadow Map Percentage Closer Filtering (PCF) versus Ray Tracing with Denoising... 74 Figure 48. Shadow Mapping Compared to Ray Traced Shadows that use 1 Sample Per Pixel and Denoising............................................................................................... 74 Figure 49. Screen-Space Ambient Occlusion Compared to Ray-Traced Ambient Occlusion........ 75 Figure 50. RTX Ray Tracing........................................................................................................... 76 Figure 51. Scene from Battlefield V with RTX On and Off............................................................ 77 Figure 52. Scene #2 from Battlefield V with RTX On and Off....................................................... 78 Figure 53. Shadow of the Tomb Raider with RTX ON .................................................................. 79 NVIDIA Turing GPU Architecture WP-09183-001_v01 | vi LIST OF TABLES Table 1. Comparison of NVIDIA Pascal GP102 and Turing TU102 .................................... 8 Table 2. Enhanced Video Engine, Tesla P4 versus Tesla T4............................................ 19 Table 3. DisplayPort Support in Turing GPUs .................................................................. 23 Table 4. Comparison of NVIDIA Pascal GP104 and Turing TU104 GPUs........................ 60 Table 5. Comparison of the Pascal Tesla P4 and the Turing Tesla T4 ........................... 61 Table 6. Comparison of NVIDIA Pascal GP104 to Turing TU106 GPUs........................... 64

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值