Real-Time Rendering——18.3 Performance Measurements性能测量

椰子糖莫莫

于 2022-10-24 09:25:08 发布

阅读量161

点赞数

分类专栏： Real-Time Rendering 文章标签：渲染管线 1024程序员节

本文链接：https://blog.csdn.net/m0_37609239/article/details/127357751

版权

Real-Time Rendering 专栏收录该内容

102 篇文章 14 订阅

订阅专栏

To optimize we need to measure. Here we discuss different measures of GPU speed.Graphics hardware manufacturers used to present peak rates such as vertices per second and pixels per second, which were at best hard to reach. Also, since we are dealing with a pipelined system, true performance is not as simple as listing these kinds of numbers. This is because the location of the bottleneck may move from one time to another, and the different pipeline stages interact in different ways during execution.Because of this complexity, GPUs are marketed in part on their physical properties,such as the number and clock rate of cores, memory size, speed, and bandwidth.

为了优化，我们需要测量。这里我们讨论GPU速度的不同衡量标准。图形硬件制造商过去常常提出峰值速率，如每秒顶点数和每秒像素数，这在最好的情况下也很难达到。此外，由于我们处理的是流水线系统，真正的性能并不像列出这些数字那么简单。这是因为瓶颈的位置可能从一个时间移动到另一个时间，并且不同的管道阶段在执行期间以不同的方式交互。由于这种复杂性，GPU的营销在一定程度上取决于它们的物理属性，如内核的数量和时钟频率、内存大小、速度和带宽。

All that said, GPU counters and thread traces, if available, are important diagnostic tools when used well. If the peak performance of some given part is known and the count is lower, then this area is unlikely to be the bottleneck. Some vendors present counter data as a utilization percentage for each stage. These values are over a given time period during which the bottleneck can move, and so are not perfect, but help considerably in finding the bottleneck.

也就是说，如果使用得当，GPU计数器和线程跟踪是重要的诊断工具。如果某个给定部件的峰值性能是已知的，并且计数较低，那么这个区域不太可能是瓶颈。一些供应商将计数器数据表示为每个阶段的利用率百分比。这些值是在瓶颈可以移动的给定时间段内的，因此并不完美，但对找到瓶颈有很大帮助。

More is better, but even seemingly simple physical measurements can be difficult to compare precisely. For example, the clock rate for the same GPU can vary among IHV partners, as each has its own cooling solution and so overclocks its GPUs to what it considers safe. Even FPS benchmark comparisons on a single system are not always as simple as they sound. NVIDIA’s GPU Boost [1666] and AMD’s PowerTune [31] technology are good examples of our dictum “know your architecture.” NVIDIA’s GPU Boost arose in part because some synthetic benchmarks worked many parts of the GPU’s pipeline simultaneously and so pushed power usage to the limit, meaning that NVIDIA had to lower its base clock rate to keep the chip from overheating. Many applications do not exercise all parts of the pipeline to such an extent, so can safely be run at a higher clock rate. The GPU Boost technology tracks GPU power and temperature characteristics and adjusts the clock rate accordingly. AMD and Intel have similar power/performance optimizations with their GPUs. This variability can cause the same benchmark to run at different speeds, depending on the initial temperature of the GPU. To avoid this problem, Microsoft provides a method in DirectX 12 to lock the GPU core clock frequency in order to get stable timings [121].Examining power states is possible for other APIs, but is more complex [354].

越多越好，但即使看似简单的物理测量也很难精确比较。例如，同一款GPU的时钟频率在IHV合作伙伴之间可能会有所不同，因为每个合作伙伴都有自己的冷却解决方案，因此会将其GPU超频到它认为安全的频率。即使是在单个系统上进行FPS基准测试比较，也不总是像听起来那么简单。NVIDIA的GPU Boost [1666]和AMD的PowerTune [31]技术很好地诠释了我们的格言“了解您的架构”NVIDIA的GPU加速在一定程度上是因为一些合成基准同时处理GPU管道的许多部分，因此将功耗推到了极限，这意味着NVIDIA必须降低其基本时钟频率，以防止芯片过热。许多应用程序不会在这种程度上使用流水线的所有部分，因此可以安全地以较高的时钟速率运行。GPU Boost技术跟踪GPU功率和温度特性，并相应地调整时钟频率。AMD和英特尔对其GPU进行了类似的功率/性能优化。这种可变性会导致相同的基准以不同的速度运行，具体取决于GPU的初始温度。为了避免这个问题，微软在DirectX 12中提供了一种方法来锁定GPU核心时钟频率，以便获得稳定的计时[121]。检查电源状态对于其他API是可能的，但是更复杂[354]。

When it comes to measuring performance for CPUs, the trend has been to avoid IPS (instructions per second), FLOPS (floating point operations per second), gigahertz,and simple short benchmarks. Instead, the preferred method is to measure wall clock times for a range of different, real programs [715], and then compare the running times for these. Following this trend, most independent graphics benchmarks measure the actual frame rate in FPS for several given scenes, and for a variety of different screen resolutions, along with antialiasing and quality settings. Many graphics-heavy games include a benchmarking mode or have one created by a third party, and these benchmarks are commonly used in comparing GPUs.

当谈到测量CPU的性能时，趋势是避免IPS(每秒指令数)、FLOPS(每秒浮点运算数)、千兆赫和简单的短期基准测试。相反，优选的方法是测量一系列不同的真实程序的挂钟时间[715]，然后比较这些程序的运行时间。遵循这一趋势，大多数独立显卡性能指标评测以FPS为单位，针对多个给定场景、各种不同的屏幕分辨率以及抗锯齿和质量设置来测量实际帧速率。许多图形密集型游戏包含基准测试模式或由第三方创建的模式，这些基准测试通常用于比较GPU。

While FPS is useful shorthand for comparing GPUs running benchmarks, it should be avoided when analyzing a series of frame rates. The problem with FPS is that it is a reciprocal measure, not linear, and so can lead to analysis errors. For example, imagine you find the frame rate of your application at different times is 50, 50, and 20 FPS.If you average these values you get 40 FPS. That value is misleading at best. These frame rates translate to 20, 20, and 50 milliseconds, so the average frame time is 30 ms, which is 33.3 FPS. Similarly, milliseconds are pretty much required when measuring the performance of individual algorithms. For a specific benchmarking situation with a given test and a given machine, it is possible to say that some particular shadow algorithm or post-process effect “costs” 7 FPS, and that the benchmark ran this much slower. However, it is meaningless to generalize this statement, since this value also depends on how much time it takes to process everything else in the frame and because you cannot add together the FPS of different techniques (but you can add times) [1378].

虽然FPS是比较GPU运行基准的有用速记，但在分析一系列帧速率时应避免使用它。FPS的问题在于它是一种倒数测量，而不是线性的，因此会导致分析错误。例如，假设您发现应用程序在不同时间的帧速率分别为50、50和20 FPS。如果你平均这些值，你得到40帧/秒。这种价值充其量是误导。这些帧速率转换为20、20和50毫秒，因此平均帧时间为30毫秒，即33.3 FPS。类似地，在测量单个算法的性能时，毫秒是非常必要的。对于给定测试和给定机器的特定基准测试情况，可以说某些特定的阴影算法或后处理效果“消耗”了7 FPS，基准测试运行得要慢得多。然而，推广这种说法是没有意义的，因为这个值也取决于处理帧中所有其他东西所花费的时间，并且因为你不能将不同技术的FPS加在一起(但是你可以增加时间)[1378]。

To be able to see the potential effects of pipeline optimization, it is important to measure the total rendering time per frame with double buffering disabled, i.e., in single-buffer mode by turning vertical synchronization off. This is because with double buffering turned on, swapping of the buffers occurs only in synchronization with the frequency of the monitor, as explained in the example in Section 2.1. De Smedt [331] discusses analyzing frame times to find and fix frame stutter problems from spikes in the CPU workload, as well as other useful tips for optimizing performance. Using statistical analysis is usually necessary. It is also possible to use GPU timestamps to learn what is happening within a frame [1167, 1422].

为了能够看到流水线优化的潜在效果，重要的是在禁用双缓冲的情况下测量每帧的总渲染时间，即在单缓冲模式下关闭垂直同步。这是因为在双缓冲开启的情况下，缓冲器的交换仅在与监视器频率同步时发生，如第2.1节中的示例所述。De Smedt [331]讨论了通过分析帧时间来发现和修复CPU工作负载峰值导致的帧停顿问题，以及其他优化性能的有用技巧。使用统计分析通常是必要的。也可以使用GPU时间戳来了解帧内发生的事情[1167，1422]。

Raw speed is important, but for mobile devices another goal is optimizing power consumption. Purposely lowering the frame rate but keeping the application interactive can significantly extend battery life and have little effect on the user’s experience[1200]. Akenine-M¨oller and Johnsson [25, 840] note that performance per watt is like frames per second, with the same drawbacks as FPS. They argue a more useful measure is joules per task, e.g., joules per pixel.

原始速度很重要，但对于移动设备来说，另一个目标是优化功耗。故意降低帧速率，但保持应用程序的交互性，可以显著延长电池寿命，对用户体验几乎没有影响[1200]。Akenine-M oller和Johnsson [25，840]指出，性能功耗比类似于每秒帧数，与FPS具有相同的缺点。他们认为更有用的衡量标准是每项任务的焦耳数，例如每像素焦耳数。