Real-Time Rendering——Chapter 18Pipeline Optimization管道优化

“We should forget about small efficiencies, say about 97% of the time: Premature optimization is the root of all evil.”
—Donald Knuth

“我们应该忘记小的效率,比如说97%的时候:过早的优化是万恶之源。” —唐纳德·克努特

Throughout this volume, algorithms have been presented within a context of quality, memory, and performance trade-offs. In this chapter we will discuss performance problems and opportunities that are not associated with particular algorithms. Bottleneck detection and optimization are the focus, starting with making small, localized changes, and ending with techniques for structuring an application as a whole to take advantage of multiprocessing capabilities.


As we saw in Chapter 2, the process of rendering an image is based on a pipelined architecture with four conceptual stages: application, geometry processing, rasterization,and pixel processing. There is always one stage that is the bottleneck—the slowest process in the pipeline. This implies that this bottleneck stage sets the limit for the throughput, i.e., the total rendering performance, and so is a prime candidate for optimization.


Optimizing the performance of the rendering pipeline resembles the procedure of optimizing a pipelined processor (CPU) [715] in that it consists mainly of two steps.First, the bottleneck of the pipeline is located. Second, that stage is optimized in some way; and after that, step one is repeated if the performance goals have not been met. Note that the bottleneck may or may not be located at the same place after the optimization step. It is a good idea to put only enough effort into optimizing the bottleneck stage so that the bottleneck moves to another stage. Several other stages may have to be optimized before this stage becomes the bottleneck again. For this reason, effort should not be wasted on over-optimizing a stage.

优化渲染流水线的性能类似于优化流水线处理器(CPU) [715]的过程,因为它主要由两个步骤组成。第一,管道的瓶颈所在。第二,以某种方式优化该阶段;之后,如果没有达到性能目标,则重复第一步。请注意,在优化步骤之后,瓶颈可能位于也可能不位于相同的位置。将足够的精力放在优化瓶颈阶段,以便瓶颈转移到另一个阶段,这是一个好主意。在这个阶段再次成为瓶颈之前,可能必须对其他几个阶段进行优化。出于这个原因,不应该浪费精力过度优化一个阶段。

The location of the bottleneck may change within a frame, or even within a draw call. At one moment the geometry stage may be the bottleneck because many tiny triangles are rendered. Later in the frame pixel processing could be the bottleneck because a heavyweight procedural shader is evaluated at each pixel. In a pixel shader execution may stall because the texture queue is full, or take more time as a particular loop or branch is reached. So, when we talk about, say, the application stage being the bottleneck, we mean it is the bottleneck most of the time during that frame. There is rarely only one bottleneck.


Another way to capitalize on the pipelined construction is to recognize that when the slowest stage cannot be optimized further, the other stages can be made to work just as much as the slowest stage. This will not change performance, since the speed of the slowest stage will not be altered, but the extra processing can be used to improve image quality [1824]. For example, say that the bottleneck is in the application stage, which takes 50 milliseconds (ms) to produce a frame, while the others each take 25 ms. This means that without changing the speed of the rendering pipeline (50 ms equals 20 frames per second), the geometry and the rasterizer stages could also do their work in 50 ms. For example, we could use a more sophisticated lighting model or increase the level of realism with shadows and reflections, assuming that this does not increase the workload on the application stage.


Compute shaders also change the way we think about bottlenecks and unused resources. For example, if a shadow map is being rendered, vertex and pixel shaders are simple and the GPU computational resources might be underutilized if fixed-function stages such as the rasterizer or the pixel merger become the bottleneck. Overlapping such draws with asynchronous compute shaders can keep the shader units busy when these conditions arise [1884]. Task-based multiprocessing is discussed in the final section of this chapter.


Pipeline optimization is a process in which we first maximize the rendering speed, then allow the stages that are not bottlenecks to consume as much time as the bottleneck.That said, it is not always a straightforward process, as GPUs and drivers can have their own peculiarities and fast paths. When reading this chapter, the dictum



should always be in the back of your mind, since optimization techniques vary greatly for different architectures. That said, be wary of optimizing based on a specific GPU’s implementation of a feature, as hardware can and will change over time [530]. A related dictum is, simply,



18.1 Profiling and Debugging Tools


Profiling and debugging tools can be invaluable in finding performance problems in your code. Capabilities vary and can include:


• Frame capture and visualization. Usually step-by-step frame replay is available,with the state and resources in use displayed.
• Profiling of time spent across the CPU and GPU, including time spent calling the graphics API.

• Shader debugging, and possibly hot editing to see the effects of changing code.
• Use of debug markers set in the application, to help identify areas of code.





Profiling and debugging tools vary with the operating system, the graphics API,and often the GPU vendor. There are tools for most combinations, and that’s why the gods created Google. That said, we will mention a few package names specifically for interactive graphics to get you started on your quest:


• RenderDoc is a high-quality Windows debugger for DirectX, OpenGL, and Vulkan, originally developed by Crytek and now open source.
• GPU PerfStudio is AMD’s suite of tools for their graphics hardware offerings,working on Windows and Linux. One notable tool provided is a static shader analyzer that gives performance estimates without needing to run the application. AMD’s Radeon GPU Profiler is a separate, related tool.
• NVIDIA Nsight is a performance and debugging system with a wide range of features. It integrates with Visual Studio on Windows and Eclipse on Mac OS and Linux.
• Microsoft’s PIX has long been used by Xbox developers and has been brought back for DirectX 12 on Windows. Visual Studio’s Graphics Diagnostics can be used with earlier versions of DirectX.
• GPUView from Microsoft uses Event Tracing for Windows (ETW), an efficient event logging system. GPUView is one of several programs that are consumers of ETW sessions. It focuses on the interaction between CPU and GPU, showing which is the bottleneck [783].
• Graphics Performance Analyzers (GPA) is a suite from Intel, not specific to their graphics chips, that focuses on performance and frame analysis.
• Xcode on OSX provides Instruments, which has several tools for timing, performance, networking, memory leaks, and more. Worth mentioning are OpenGL ES Analysis, which detects performance and correctness problems and proposes solutions, and Metal System Trace, which provides tracing information from the application, driver, and GPU.


GPU PerfStudio是AMD为其图形硬件产品提供的工具套件,可在Windows和Linux上工作。提供的一个值得注意的工具是静态着色器分析器,它无需运行应用程序即可提供性能估计。AMD的镭龙GPU分析器是一个独立的相关工具。

NVIDIA Nsight是一个性能和调试系统,具有广泛的功能。它集成了Windows上的Visual Studio以及Mac OS和Linux上的Eclipse。

微软的PIX长期以来一直被Xbox开发者使用,并在Windows上为DirectX 12带来了它。Visual Studio的图形诊断可以与早期版本的DirectX一起使用。

微软的GPUView使用Windows (ETW)事件跟踪,这是一个高效的事件记录系统。GPUView是ETW会话的几个消费者程序之一。它侧重于CPU和GPU之间的交互,显示哪个是瓶颈[783]。 图形性能分析器(GPA)是英特尔的一个套件,不针对其图形芯片,侧重于性能和帧分析。

OSX上的Xcode提供了Instruments,它有几个工具用于计时、性能、联网、内存泄漏等等。值得一提的是OpenGL ES Analysis,它检测性能和正确性问题并提出解决方案,以及Metal System Trace,它提供来自应用程序、驱动程序和GPU的跟踪信息。

These are the major tools that have existed for a few years. That said, sometimes no tool will do the job. Timer query calls are built into most APIs to help profile a GPU’s performance. Some vendors provide libraries to access GPU counters and thread traces as well.


  • 0
  • 0
    觉得还不错? 一键收藏
  • 打赏
  • 0


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则




¥1 ¥2 ¥4 ¥6 ¥10 ¥20



钱包余额 0


