Performance Tuning for Tile-Based Architectures

https://docs.unity3d.com/ScriptReference/Rendering.RenderBufferLoadAction.html
OpenGL Insights.pdf
https://docs.imgtec.com/PowerVR_Architecture/topics/powervr_architecture_tile_based_deferred_rendering__tbdr.html
https://gameinstitute.qq.com/community/detail/123220
SRAM和DRAM https://blog.csdn.net/stpeace/article/details/78307149

23.1 introduction
the opengl and opengles specifications describe a virtual pipeline in which triangles are processed in order:

the vertices of a triangle are transformed, the triangle is set up and rasterized to produce fragments,
the fragments are shaded and then written to the framebuffer.
once this has been done, the next triangle is processed, and so on.

however, this is not the most efficient way for a GPU to work;
GPUs will usually reorder and parallelize things under the hood for (under the hood [释义]在后台,在底层;)
for better performance.

in this chapter, we will examine tile-based rendering, a particular way to arrange a graphics pipeline that is used in several
popular mobile GPUs.
we will look at what tile-based rendering is and why it is used and then look at what needs to be done differently
to achieve optimal performance.

i assume that the reader already has experience with optimizing opengl applications and is familiar with the standard techniques, such as reducing state changes, reducing the number of draw calls, reducing shader complexity and texture compression, and is looking for advice that is specific to tile-based GPUs.

keep in mind that every GPU, every driver, and every application is different and will have different performance characteristic. Ultimately, performance tuning性能调优 is a process of profiling and experimentation.
thus, this chapter contains very few hard-and-fast 必须遵守的 rules
but instead tries to illustrate how to estimate the costs associated with different approaches.

this chapter is about maximizing performance, but since tile-based GPUs are currently popular in mobile devices, we will
also briefly mention power consumption.

many desktop applications will simply render as many frames per second as possible, always consuming 100% of the available processing power.
deliberately 故意 throttling 节流 the frame rate to a more modest level and thus consuming less power can significantly extend battery life while having relatively little impact on user experience.

of course, this does not mean that one should stop optimizing after achieving the target frame rate:
further optimizations will then allow the system to spend more time idle and hence improve power consumption.

the main focus of this chapter will be on OpengL ES since that is the primary market for tile-based GPUs,
but occasionally I will touch on desktop OpenGL features and how they might perform.

23.2 background
while performance is the main goal for desktop GPUs, mobile GPUs must balance performance against power consumption, i.e., battery life.
one of the biggest consumers of power in a device is memory bandwidth:
computations are relatively cheap, but the further data has to moved, the more power is takes.

the opengl virtual pipeline requires a large amount of bandwidth.
for a fairly typical use-case, each pixel will require a read from the depth/stencil buffer, a write back to the depth/stencil buffer, and a write to the color buffer, say 12 bytes of traffic, assumping no overdraw, no blending, no multipass algorithms, and no multisampling. with all the bells and whistles,这个俚语好,附加功能
one can easily generate over 100 bytes of memory traffic for each displayed pixel.
since at most 4 bytes of data are needed per displayed pixel, this is an excessive use of bandwidth and hence power.
in reality, desktop GPUs use compression techniques to reduce the bandwidth, but it is still significant.

https://www.cnblogs.com/resn/p/5766142.html
在这里插入图片描述

https://developer.samsung.com/galaxy-gamedev/resources/articles/gpu-framebuffer.html

to reduce this enormous bandwidth demand, many mobile GPUs use tile-based rendering. at the most basic level,
these GPUs move the framebuffer, including the depth buffer, multisample buffers, etc., out of main memory and into high-speed on-chip memory. since this memory is on-chip, and close to where the computations occur, far less power is required to access it.
if it were possible to place a large framebuffer in on-chip memory, that would be the end of the story; but unfortunately, that would take far too much silicon.
the size of the on-chip framebuffer, tile buffer, varies between GPUs but can be as small as 16x16 pixels.

在这里插入图片描述

this poses some new challenges: how can a high-resolution image be produced using such a small tile buffer?
the solution is to break up the opengl framebuffer into 16x16 tiles (hence the name “tile-based rendering”) and render one at a time.
for each tile, all the primitives that affect it are rendered into the tile buffer, and once the tile is complete, it is copied to the more power-hungry main memory, as shown in figure 23.1.
在这里插入图片描述

the bandwidth advantage comes from only having to write back a minimum set of results:
no depth/stencil values, no overdrawn pixels, and no multisample buffer data.
additionally, depth/stencil testing and blending are done entirely on-chip.

在这里插入图片描述
23.3 clearing and discarding the framebuffer
when it comes to performance-tuing, the most important thing to remember about a tile-based GPU is currently being construted is not a framebuffer but the frame data:
lists of transformed vertices, polygons, and state necessary to produce the framebuffer.
unlike a framebuffer, these data grow as more draw calls are issued in a frame.
it is thus important to ensure that frames are properly terminated so that the frame data do not grow indefinitely.

consider the use of glClear. typical desktop GPUs are immediate-mode architectures, meaning that they draw fragments as soon as all the data for a triangle are available.
on an immediate-mode GPU, a call to glClear actually writes values into the framebuffer and thus can be expensive.

things become more difficult when using framebuffer objects, which do not have a swap operation. specially, consider the use of glClear.
typical desktop gpus are immediate-mode architectures, meaning that they draw fragments as soon as all the data for a triangle are available.
on an immediate-mode GPU, a call to glClear actually writes values into the framebuffer and thus can be expensive. programmers use assorted 多种多样的 tricks to avoid this, such as not clearing the color buffer if they know it will be completely overwritten and using the half the depth range on alternate frames to avoid clearing the depth buffer.

on a tile-based architecture, avoiding clears can be disasterous for performance:
since the frame is built up in frame data, clearing all buffers will simply free up the existing frame data. in other words, not only is glClear very cheap, it actually improves performance by allowing unneeded frame data to be discarded.

to get the full benefit of this effect, it is necessary to clear everything:
using a scissor or a mask or only clearing a subset of color, depth, and stencil will prevent the frame data from being freed. while drivers may detect more cases where clearing can free the frame data, the safest and most portable approach is shown in listing 23.1.
在这里插入图片描述
this should be done at the start of each frame, unlesss the window system already takes care of discarding the framebuffer contents. of course, the masks and scissor enable do not need to be set explicitly if they are already in the correct state.

23.4 incremental frame updates
for a 3D view with a moving camera, such as in a first-person shooter game, it is reasonable to expect every pixel to change from fame to frame, and so clearing the framebuffer will not destroy any useful information. for more GPI-like applications, there may be assorted controls or information views that do not change from frame to frame and which do not need to be regenerated.
application developers using EGL on a tile-based GPU are often surprised to find that the color buffer does not persist from frame to frame. EGL 1.4 allows this to be explicitly requested by setting EGL SWAP BEHAVIOR on the surface, but it is not the default on a tile-based GPU since it reduces performance.

to understand why back-buffer preservation reduces performance, consider again how a tile-based GPU composes fragments for a single tile. if the framebuffer is cleared at the start of a frame, the tile buffer need only be initialized to the clear color before fragments are drawn,

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值