Arm Mali GPU Best Practices Developer Guide - Version 2.1

最新推荐文章于 2024-06-23 09:33:03 发布

Yongqiang Cheng

最新推荐文章于 2024-06-23 09:33:03 发布

阅读量724

点赞数 1

分类专栏： C++ tmp 文章标签： Arm Mali GPU Best Practices Developer Guide Version 2.1

世上没有白读的书，每一页都算数。

本文链接：https://blog.csdn.net/chengyq116/article/details/108312706

版权

C++ tmp 专栏收录该内容

281 篇文章

订阅专栏

Arm Mali GPU Best Practices Developer Guide - Version 2.1

Mali GPU Best Practice Guide
https://developer.arm.com/solutions/graphics-and-gaming/developer-guides/advanced-guides/mali-gpu-best-practices

8. Compute

This chapter covers how to optimize workgroup sizes, how to correctly use shared memory on a Mali GPU, and optimized ways to process images.
本章介绍如何优化 workgroup sizes，如何正确使用 Mali GPU 上的共享内存以及优化的图像处理方式。

It contains the following sections:

8.1 Workgroup sizes.
8.2 Shared memory.
8.3 Image processing.

8.1 Workgroup sizes

Mali Midgard, Bifrost, and Valhall GPUs have a fixed number of registers available in each shader core. These GPUs can split those registers across a variable number of threads depending on the register usage requirements of the shader program.
Mali Midgard, Bifrost, and Valhall GPUs 在每个着色器内核中都有固定数量的寄存器。这些 GPU 可以根据着色器程序的寄存器使用要求在可变数量的线程上拆分这些寄存器。

8.1.1 Prerequisites (先决条件)

You must understand the following concepts:

Shader core resource scheduling. (着色器核心资源调度。)
Workgroups. (工作组。)
Stack memory. (堆栈内存。)

8.1.2 Using workgroups (使用工作组)

The GPU hardware can split up, and then merge, workgroups during shader core resource scheduling. If barriers or shared memory are used, then GPUs cannot do this with workgroups. In such a case, all work items in the workgroup must be executed concurrently in the shader core.
GPU 硬件可以在着色器核心资源调度期间拆分然后合并工作组。如果使用 barriers or shared memory，则 GPU 不能在工作组中执行此操作。在这种情况下，必须在着色器核心中同时执行工作组中的所有工作项。

Large workgroup sizes restrict the number of registers that are available to each work item in this scenario. In turn, forcing shader programs to use stack memory if insufficient registers are available.
workgroup sizes 较大，会限制此方案中每个工作项可用的寄存器数量。反过来，如果没有足够的寄存器，则强制着色器程序使用堆栈存储器。

8.1.3 How to optimize the use of workgroup sizes (如何优化使用工作组大小)

Try using the following optimization steps:

Use 64 as a baseline workgroup size. (使用 64 作为基准工作组大小。)
Use a multiple of 4 as a workgroup size. (使用 4 的倍数作为工作组大小。)
Try smaller workgroup sizes before larger ones, especially if using barriers or shared memory. (在使用较大的工作组之前，请尝试使用较小的工作组，尤其是在使用 barriers 或共享内存的情况下。)
When working with images or textures, use a square execution dimension, for example 8x8, to exploit optimal 2D cache locality. (处理图像或纹理时，请使用正方形执行尺寸 (例如 8x8) 来利用最佳 2D 缓存局部性。)
If a workgroup has per-workgroup work to be done, consider splitting the work into two passes. Doing so avoids barriers and kernels that contain portions where most threads are idle. (如果工作组有每个工作组要完成的工作，请考虑将工作分为两步。这样做可以避免 barriers 和内核包含大部分线程处于空闲状态的部分。)
Compute shader performance is not always intuitive, so keep measuring the performance levels. (计算着色器的性能并不总是直观的，因此请继续测量性能级别。)

8.1.4 Things to avoid when optimizing workgroup sizes (优化工作组大小时应避免的事情)

Arm recommends that you:

Do not use more than 64 threads per workgroup. (每个工作组使用的线程数不要超过 64。)
Do not assume that barriers with small workgroups are free from performance costs. (不要以为小工作组的 barriers 没有性能成本。)

8.1.5 Negative impacts of not using workgroup sizes correctly (不正确使用工作组大小的负面影响)

The different types of impact you can see are: (您可以看到的不同类型的影响是：)

Be careful with large workgroups. If a high percentage of work items are waiting on a barrier, then the shader core can be starved of work. (大尺度工作组要小心。如果高比例的工作项在 barrier 上等待，则着色器核心可能缺乏工作。)
Shaders that spill to the stack incur a higher load and store unit utilization, along with a higher cost to external memory bandwidth. (溢出到堆栈上的着色器会导致更高的负载和存储单元利用率，以及更高的外部内存带宽成本。)