Measuring Performance MS and FPS

https://catlikecoding.com/unity/tutorials/basics/measuring-performance/

use game window status, frame debugger, and profiler.
compare dynamic batching, gpu instancing, and srp batcher.
display a frame rate counter.
cycle through functions automatically. 这个咋翻译?
smoothly transition between functions. 函数之间平滑过渡

this is the fourth tutorial in a series about learning the basic https://catlikecoding.com/unity/tutorials/basics/
of working with unity. it is an introduction to measuring performance. we will also add the ability to morph 变形 from function to another to our function library.

this tutorial is made with unity 2019.4.12f1.
在这里插入图片描述
1 profiling unity
unity continuously renders new frames. to make anything that moves appear fluid it has to do this fast enough so we perceive the sequence of images as continuous motion. typically 30 frames per second——FPS for short——is the minimum to aim for and 60FPS is ideal.
these numbers appear often because many devices have a display refresh rate of 60 hertz. you can not draw frames faster than that without turing VSync off, which will cause image tearing. if consistent 60 FPS can not be achieved then the next best rate is 30FPS, which is once per two (0.5) display refreshes. one step lower 更低的话 would be 15 FPS, which is insufficient for fluid motion.

whether a target frame rate can be achieved depends on how long it takes to process individual frames. to reach 60FPS we must update and render each frame in less than 16.67 milliseconds. the time budget for 30FPS is 33.33 ms per frame.

when our graph is running we can get a sense of how smooth its motion is by simply observing it, but this is a very imprecise 不精确的 way to measure its performance. if motion appears smooth then it probably exceeds 30 FPS and if it appears to stutter it’s probably less than that. it might also be smooth one moment and stutter the next, due to inconsistent performance. this can be caused by variation in our app, but also due to other apps running on the same device. if we barely reached 60FPS then we could end up going back and forth between 30FPS and 60FPS rapidly, which would feel jittery despite a high average FPS. so to get a good idea of what’s going on we have to measure performance more precisely. unity has a few tools to help us with this.

1.1 game window statistics
The game window has a Statistics overlay panel that can be activated via its Stats toolbar button. It displays measurements taken for the last rendered frame. It doesn’t tell us much, but it’s the simplest tool that we can use to get an indication of what’s going on. While in edit mode the game windows usually updates only sporadically, after something changed. In play mode it refreshes all the time.

The following statistics are for our graph with the torus function and resolution at 100, using the default render pipeline, which I’ll refer to as DRP from now on. I have VSync turned on for the game window, so refreshes are synchronized with my 60 Hz display.

在这里插入图片描述

The statistics show a frame during which the CPU main thread took 23.6ms and the render thread took 27.8ms. You’ll likely get different results, depending on your hardware. In my case it suggests that the entire frame took 51.4ms to render, but the statistics panel reported 36FPS, matching the render thread time. The FPS indicator seems to takes the worst of both and assumes that matches the frame rate. This is an oversimplification that only takes the CPU side into account, ignoring the GPU and display. The real frame rate is likely lower.

What’s a thread?
A thread is a subprocess, in this case of the Unity app. There can be multiple threads running in parallel at the same time. The statistics show how long Unity’s main and render threads were running during the last frame.

Besides the durations and FPS indication the statistics panel also displays various details about what was rendered. There were 30.003 batches, and apparently zero saved by batching. These are draw commands send to the GPU. Our graph contains 10.000 points, so it appears that each point got rendered three times. That’s once for a depth pass, once for shadow casters—listed separately as well—and once to render the final cube, per point. The other three batches are for additional work like the sky box and shadow processing that is independent of our graph. There were also six set-pass calls, which can be though of as the GPU getting reconfigured to render in a different way, like with a different material.
在这里插入图片描述
If we switch to URP the statistics are different. It renders faster and in this case the main CPU thread is slower than the render thread. It’s easy to guess why: there are only 20.001 batches, 10.000 less than for DRP. That’s because URP doesn’t use a separate depth pass for directional shadows. The statistics report zero shadow casters, but that’s because it can only show this number for DRP.

Another strange thing is that a negative number might be shown for Saved by batching. This happens because URP uses the SRP batcher by default and the statistics panel doesn’t understand it. The SRP batcher doesn’t eliminate individual draw commands but can make them much more efficient. To illustrate this select our URP asset and disable SRP Batcher under the Advanced section at the bottom of its inspector.
在这里插入图片描述
With the SRP batcher disabled URP performance is much worse
在这里插入图片描述
1.2 dynamic batching
besides the srp batcher URP has another toggle for dynamic batching. this is an old technique that dynamically combines small meshes into a single larger one which then gets rendered instead. enabling it for UPR reduces batches to 10.023 and the statitics panel indicates that 9.978 draws were eliminated.
在这里插入图片描述
Statistics for URP with dynamic batching.

in my case the SRP batcher and dynamic batching have comparable performance, because the cube meshes of our graph’s points are ideal candidates for dynamic batching.

the SRP batches is not available for DRP, but we can enable dynamic batching for it. in this case we can find the toggle in the other settings section of the player project settings, a bit below from where we set color space to linear. it is only visible when no scriptable render pipeline settings are used.
在这里插入图片描述
Statistics for DRP with dynamic batching.——D-》default

dynamic batching is much more efficient for DRP, eliminating 29.964 batches, reducing them to only 39.
it is an improvement, but still not as fast as URP.

1.3 gpu instancing
another way to improve rendering performance is by enabling gpu instancing. this makes it possible to use a single draw command to tell the GPU to draw many instances of one mesh with the same material, providing an array of transformation matrices and optionally other instance data. in this case we have to enable it per material. ours have an enable gpu instancing toggle for it.
在这里插入图片描述
Material with GPU instancing enabled.

URP prefers the SRP batcher over GPU instancing, so to make it work for our points the SRP batcher has to be disabled. we can then see that the amount of batches is reduced to just 45, much better than dynamic batching. we will discover the reason for this difference later.
在这里插入图片描述
Statistics for URP with GPU instancing.

we could conclude from this data that for URP gpu instancing is best, followed by dynamic batching, and then the SRP batcher. but the difference is small and the indicated FPS is higher than my display refresh rate in all cases, so they seem effectively equivalent for our graph. the only clear conclusion is that using none of those is not good idea.

for drp gpu instancing appears to perform a bit better than dynamic batching, and both approaches are a lot better than using neither.
在这里插入图片描述
Statistics for DRP with GPU instancing.

1.4 fame debugger
the statistics panel can tell us that using dynamic batching is different than using GPU instancing, but but but does not tell us why.
to get a better understanding of what’s going on we can use the frame debugger, opened via window/analysis/frame debugger. when enabled via its toolbar button it shows a list of all draw commands send to the gpu for the last frame of the game window, grouped under profiling samples. this list is shown on its left side. on its right side details are shown of a specific selected draw command.
also, the game window shows the progressive draw state until directly after the selected command.

in our case we must be in play mode, because that is when our graph gets drawn. enbaling the frame debugger will pause play mode, which allows us to inspect the draw command hierarchy. let us fist do this for DRP, without using dynamic batching nor GPU instancing.
在这里插入图片描述
Frame debugger for DRP.

we see a total of 30.007 draw calls, more than the statistics panel reported because there are also commands that are not counted as batches, such as clearing a target buffer. The 30.000 draws for our points are individually listed as Draw Mesh Point(Clone), under UpdateDepthTexture, Shadows.RenderShadowMap, and RenderForward.RenderLoopJob.

if we try again with dynamic batching enabled the command structure remains the same, except that each group of 10.000 draw is reduced to tweleve Draw Dynamic calls. this is a significant improvement.
在这里插入图片描述
DRP with dynamic batching.

and if we use GPU instancing then each group gets reduced to 20 draw mesh(instanced) point(clone) calls instead. again a big improvement, but a different approach.
在这里插入图片描述

DRP with GPU instancing.

we can see the same happen for URP, but with a different command hierarchy. in this case the points are drawn twice, first under render main shadowmap and again under Render Opaques. a significant difference is that dynamic batching does not appear to work for the shadow map, which explains why it is less effective for URP. we also end up with 22 batches instead of only tweleve, indicating that the URP material relies on more mesh vertex data than the standard DRP one, so less points fit in a single batch. unlike dynamic batching GPU instancing does work for shadwos, so it is superior in this case.
在这里插入图片描述

在这里插入图片描述
在这里插入图片描述

URP with nothing, dynamic batching, and GPU instancing.

finally, with the SRP enabled drawing 10.000 points get listed as 11 SRP batch commands, but keep in mind that these are still individual draw calls, just very efficient ones.
在这里插入图片描述
URP with SRP batcher.

1.5 an extra light
the results that we got so far are for our graph, with a single directional light, and the other project settings that we use. let us see what happens when we add a second light to the scene, specifically a point light via GameObject/Light/Point Light. set is position to zero and make sure that it does not cast shadows, which is its default behavior. DRP supports shadows for point lights, but URP still does not.
在这里插入图片描述
Point light without shadows at origin.

with the extra light DRP now draws all points an additional time. the frame debugger shows us that RenderForward.RenderLoopJob renders twice as much as before. even worse, dynamic batching now only works for the depth and shadow passes, not the forward passes.
在这里插入图片描述
this happens because DRP draws each object once per light. it has a main pass that works with a single directional light, followed by additional passes are rendered on top of it. this happens because it is an old-fashioned forward-addive render pipeline. dynamic batching can not handle these different passes, so does not get used.

the same is true for GPU instancing, except that it still works for the main pass. only the additional light passes do not benefit from it.
在这里插入图片描述
DRP with GPU instancing.

the second light appears to make no difference for URP, because it is a modern forward renderer that applies all lighting in a single pass. so the command list remains the same, even though the GPU needs to perform more lighting calculations per draw.

these conclusions are for a single extra light that affects all points. if u add more lights and move them such that different points are affected by different lights, things get more complicated and batches can get split up when GPU instancing is used. what is true for a simple scene might not be true for a complex one.

1.6 profiler
to get a better idea of what’s happening on the CPU side we can open the profiler window. Turn off the point light and open the window via Window / Analysis / Profiler. It will record performance data while in play mode and store it for later inspection.

the profiler is split in two sections. its top portion contains a list of modules that show various performance graphs.
the top one is CPU usage, which is what we will focus on. with that module selected the bottom part of the window shows a detailed breakdown 分解 of a frame that we can select in the graph.

the default bottom view for CPU usage is the timeline. it visualizes how much time was spent on what during a frame. it shows that each frame begins with PlayerLoop, which spends most of its time invoking

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值