AMD OpenCL Programming Guide - OpenCL Architecture

最新推荐文章于 2024-08-27 10:56:31 发布

Yongqiang Cheng

最新推荐文章于 2024-08-27 10:56:31 发布

阅读量768

点赞数 3

分类专栏： ARM Mali / AMD GPU - OpenCL 文章标签： OpenCL Architecture AMD Programming Guide

世上没有白读的书，每一页都算数。

本文链接：https://blog.csdn.net/chengyq116/article/details/122546889

版权

ARM Mali / AMD GPU - OpenCL 专栏收录该内容

24 篇文章 10 订阅

订阅专栏

AMD OpenCL Programming Guide - OpenCL Architecture

https://rocmdocs.amd.com/en/latest/Programming_Guides/Opencl-programming-guide.html

5. Memory Architecture and Access - 内存架构和访问

OpenCL has four memory domains: private, local, global, and constant; the AMD Compute Technology system also recognizes host (CPU) and PCI Express® (PCIe® ) memory.

Type and Description
private
Specific to a work-item; it is not visible to other work-items.
特定于一个工作项，它对其它工作项不可见。

local
Specific to a work-group; accessible only by work-items belonging to that work-group.
特定于工作组，只能由属于该工作组的所有工作项访问。

global
Accessible to all work-items executing in a context, as well as to the host (read, write, and map commands).
在上下文中执行的所有工作项以及主机 (读取、写入和映射命令) 可以访问。

constant
Read-only region for host-allocated and -initialized objects that are not changed during kernel execution.
由主机分配和初始化的对象，在内核执行期间未更改的只读区域。

host (CPU)
Host-accessible region for an application’s data structures and program data.
主机可访问区域，存放应用程序数据结构和程序数据。

PCIe
Part of host (CPU) memory accessible from, and modifiable by, the host program and the GPU compute device. Modifying this memory requires synchronization between the GPU compute device and the CPU.
主机 (CPU) 内存的一部分，可从主机程序和 GPU 计算设备访问和修改。修改此内存需要 GPU 计算设备和 CPU 之间的同步。

在这里插入图片描述

Figure 5.1 Interrelationship of Memory Domains for `Southern Islands` Devices

Southern Islands (南方群岛)
interrelationship [ˌɪntərɪˈleɪʃ(ə)nʃɪp]：n. 相互关联，相互影响
implicitly [ɪm'plɪsɪtlɪ]：adv. 暗中，暗含，含蓄地，蕴涵，隐式
explicitly [ɪk'splɪsɪtli]：adv. 明确，显然，清楚地，直接地，显式

在这里插入图片描述

Figure 5.2 Dataflow between Host and GPU

There are two ways to copy data from the host to the GPU compute device memory:

Implicitly by using clEnqueueMapBuffer and clEnqueueUnMapMemObject.
Explicitly through clEnqueueReadBuffer, clEnqueueWriteBuffer and (clEnqueueReadImage, clEnqueueWriteImage).

When using these interfaces, it is important to consider the amount of copying involved. There is a two-copy processes: between host and PCIe, and between PCIe and GPU compute device.
使用这些接口时，重要的是要考虑所涉及的复制量。存在两份复制进程：主机与 PCIe 之间，以及 PCIe 与 GPU 计算设备之间。

With proper memory transfer management and the use of system pinned memory (host/CPU memory remapped to the PCIe memory space), copying between host (CPU) memory and PCIe memory can be skipped.
通过适当的内存传输管理和使用系统固定内存 (host/CPU memory 重新映射到 PCIe memory space)，host (CPU) memory and PCIe memory 之间的复制可以跳过。

Double copying lowers the overall system memory bandwidth. In GPU compute device programming, pipelining and other techniques help reduce these bottlenecks. See the AMD OpenCL Optimization Reference Guide for more specifics about optimization techniques.
双重复制降低了整个系统内存带宽。在 GPU 计算设备编程中，流水线和其他技术有助于减少这些瓶颈。

page-locked memory or pinned memory：页锁定内存，固定内存

在这里插入图片描述

Figure 5.3 Data Transfer Between Host and CUDA Device

5.1 Data Share Operations - 数据共享操作

Local data share (LDS) is a very low-latency, RAM scratchpad for temporary data located within each compute unit. The programmer explicitly controls all accesses to the LDS. The LDS can thus provide efficient memory access when used as a software cache for predictable re-use of data (such as holding parameters for pixel shader parameter interpolation), as a data exchange machine for the work-items of a work-group, or as a cooperative way to enable more efficient access to off-chip memory.
local data share (LDS) 是一个非常低延迟的 RAM 暂存器，用于存储位于每个计算单元内的临时数据。程序员明确控制对 LDS 的所有访问。因此，当用作可预测的数据重用的软件缓存 (例如为像素着色器参数插值保存参数) 时，LDS 可以提供有效的内存访问，用作一个工作组中的工作项的数据交换机器，或作为一种合作方式，可以更有效地访问片外存储器。

scratchpad ['skrætʃ.pæd]：n. 便条，便笺式存储器，高速暂存存储器
scatter [ˈskætə(r)]：v. 撒，四散，驱散，撒播 n. 散落，三三两两，零零星星
order [ˈɔː(r)də(r)]：n. 顺序，命令，勋章，规则 v. 命令，整理，定货，下令

The high-speed write-to-read re-use of the memory space (full gather/read/load and scatter/write/store operations) is especially useful in pre-GCN devices with read-only caches. LDS offers at least one order of magnitude higher effective bandwidth than direct, uncached global memory.
内存空间的高速 write-to-read 重用 (full gather/read/load and scatter/write/store operations) 在具有只读缓存的 pre-GCN 设备中特别有用。LDS 提供的有效带宽至少比直接的、未缓存的全局内存高一个数量级。

在这里插入图片描述

Figure 5.4 High-Level Memory Configuration

Physically located on-chip, directly next to the ALUs, the LDS is approximately one order of magnitude faster than global memory (assuming no bank conflicts).
LDS 物理上位于芯片上，紧邻 ALU，比全局内存快大约一个数量级 (假设没有 bank conflict)。

GCN devices contain 64 kB memory per Compute Unit and allow up to a maximum of 32 kB per workgroup.
GCN 设备每个计算单元包含 64 kB 内存，每个工作组最多允许 32 kB。

The high bandwidth of the LDS memory is achieved not only through its proximity to the ALUs, but also through simultaneous access to its memory banks. Thus, it is possible to concurrently execute 32 write or read instructions, each nominally 32-bits; extended instructions, read2/write2, can be 64-bits each. If, however, more than one access attempt is made to the same bank at the same time, a bank conflict occurs. In this case, for indexed and atomic operations, hardware prevents the attempted concurrent accesses to the same bank by turning them into serial accesses. This decreases the effective bandwidth of the LDS. For maximum throughput (optimal efficiency), therefore, it is important to avoid bank conflicts. A knowledge of request scheduling and address mapping is key to achieving this.
LDS 内存的高带宽不仅是通过它靠近 ALU 来实现的，而且还通过同时访问其 memory bank 来实现。因此，可以同时执行 32 条写或读指令，每条标称 32 位，扩展指令，read2/write2 每个可以是 64 位。但是，如果同时对同一个 memory bank 进行多次访问尝试，则会发生 bank conflict。在这种情况下，对于索引操作和原子操作，硬件通过将它们转换为串行访问来防止对同一 bank 尝试并发访问。这会降低 LDS 的有效带宽。因此，为了获得最大吞吐量 (最佳效率)，避免 bank conflict 很重要。了解请求调度和地址映射是实现这一目标的关键。

LDS memory 紧邻 ALU，同时 LDS memory 可以划分为 bank。

hierarchy [ˈhaɪəˌrɑː(r)ki]：n. 层次体系，等级制度，统治集团
proximity [prɒkˈsɪməti]：n. 邻近
coherence [kəʊˈhɪərəns]：n. 连贯性，条理性

5.2 Dataflow in Memory Hierarchy - 内存层次结构中的数据流

在这里插入图片描述

Figure 5.5 Memory Hierarchy Dataflow in pre-GCN devices

To load data into LDS from global memory, it is read from global memory and placed into the work-item’s registers; then, a store is performed to LDS. Similarly, to store data into global memory, data is read from LDS and placed into the work- item’s registers, then placed into global memory. To make effective use of the LDS, an algorithm must perform many operations on what is transferred between global memory and LDS. It also is possible to load data from a memory buffer directly into LDS, bypassing VGPRs.
要将数据从全局内存加载到 LDS，它会从全局内存中读取并放入工作项的寄存器中；然后，对 LDS 进行存储。类似地，要将数据存储到全局内存中，从 LDS 读取数据并将其放入工作项的寄存器，然后再放入全局内存中。为了有效利用 LDS，算法必须对全局内存和 LDS 之间传输的内容执行许多操作。也可以绕过 VGPR，将数据从内存缓冲区直接加载到 LDS。

vector general purpose register，VGPR：矢量通用寄存器

LDS atomics are performed in the LDS hardware. (Thus, although ALUs are not directly used for these operations, latency is incurred by the LDS executing this function.) If the algorithm does not require write-to-read reuse (the data is read only), it usually is better to use the image dataflow (see right side of Figure 5.5) because of the cache hierarchy.
LDS 原子操作在 LDS 硬件中执行。因此，虽然 ALU 不直接用于这些操作，但 LDS 执行此功能会产生延迟。如果算法不需要 write-to-read 重用 (数据是只读的)，由于缓存层次结构，通常最好使用图像数据流 (参见 Figure 5.5 的右侧)。

Actually, buffer reads may use L1 and L2. When caching is not used for a buffer, reads from that buffer bypass L2. After a buffer read, the line is invalidated; then, on the next read, it is read again (from the same wavefront or from a different clause). After a buffer write, the changed parts of the cache line are written to memory.
实际上，buffer 读取可能会使用 L1 和 L2。当缓存不用于 buffer 时，从该 buffer 读取绕过 L2。buffer 读取后，该行无效；然后，在下一次读取时，再次读取 (从相同的 wavefront 或不同的 clause)。在 buffer 写入之后，高速缓存行的更改部分被写入内存。

Buffers and images are written through the texture L2 cache, but this is flushed immediately after an image write.
buffer and image 通过 texture L2 缓存写入，但在 image 写入后立即刷新。

In GCN devices, both reads and writes happen through L1 and L2.
在 GCN 设备中，读取和写入都通过 L1 和 L2 发生。

The data in private memory is first placed in registers. If more private memory is used than can be placed in registers, or dynamic indexing is used on private arrays, the overflow data is placed (spilled) into scratch memory. Scratch memory is a private subset of global memory, so performance can be dramatically degraded if spilling occurs.
私有内存中的数据首先放在寄存器中。如果使用的私有内存超过了寄存器的容量，或者在私有数组上使用了动态索引，则溢出数据将被放置 (溢出) 到暂存内存中。暂存内存是全局内存的私有子集，因此如果发生溢出，性能会显着降低。

Global memory can be in the high-speed GPU memory (VRAM) or in the host memory, which is accessed by the PCIe bus. A work-item can access global memory either as a buffer or a memory object. Buffer objects are generally read and written directly by the work-items. Data is accessed through the L2 and L1 data caches on the GPU. This limited form of caching provides read coalescing among work-items in a wavefront. Similarly, writes are executed through the texture L2 cache.
全局内存可以位于高速 GPU 内存 (VRAM) 中，也可以位于主机内存中，由 PCIe 总线访问。工作项可以作为缓冲区或内存对象访问全局内存。buffer 对象通常由工作项直接读取和写入。通过 GPU 上的 L2 和 L1 数据缓存访问数据。这种有限形式的缓存提供了一个 wavefront 中的工作项之间的读取合并。类似地，写入是通过纹理 L2 缓存执行的。

Global atomic operations are executed through the texture L2 cache. Atomic instructions that return a value to the kernel are handled similarly to fetch instructions: the kernel must use S_WAITCNT to ensure the results have been written to the destination GPR before using the data.
全局原子操作通过纹理 L2 缓存执行。返回一个值到内核的原子指令的处理方式与获取指令类似：内核必须使用 S_WAITCNT 来确保在使用数据之前将结果写入目标 GPR。

atomic [əˈtɒmɪk]：adj. 原子的，与原子有关的，原子能的，原子武器的
clause [klɔːz]：n. 子句，从句，分句，（法律文件的）条款
spill [spɪl]：v. 溢出，涌出，蜂拥而出 n. 跌落，洒出，泼出，溢出
coalesce [ˌkəʊəˈles]：v. 合并，结合，联合

5.3 Memory Access - 内存访问

Using local memory (known as local data store, or LDS, as shown in Figure 5.1) typically is an order of magnitude faster than accessing host memory through global memory (VRAM), which is one order of magnitude faster again than PCIe. However, stream cores do not directly access memory; instead, they issue memory requests through dedicated hardware units. When a work-item tries to access memory, the work-item is transferred to the appropriate fetch unit. The work-item then is deactivated until the access unit finishes accessing memory. Meanwhile, other work-items can be active within the compute unit, contributing to better performance. The data fetch units handle three basic types of memory operations: loads, stores, and streaming stores. GPU compute devices can store writes to random memory locations using global buffers.
使用局部内存 (known as local data store, or LDS, as shown in Figure 5.1) 通常比通过全局内存 (VRAM) 访问主机内存快一个数量级，后者又比 PCIe 快一个数量级。但是，stream core 并不直接访问内存。相反，它们通过专用硬件单元发出内存请求。当一个工作项试图访问内存时，该工作项被转移到适当的获取单元。然后工作项被停用，直到访问单元完成对内存的访问。同时，其他工作项可以在计算单元中处于活动状态，从而提高性能。数据获取单元处理三种基本类型的内存操作：loads, stores, and streaming stores。GPU 计算设备可以使用全局缓冲区将写入存储到随机内存位置。

5.4 Global Memory - 全局内存

The global memory lets applications read from, and write to, arbitrary locations in memory. When using global memory, such read and write operations from the stream kernel are done using regular GPU compute device instructions with the global memory used as the source or destination for the instruction. The programming interface is similar to load/store operations used with CPU programs, where the relative address in the read/write buffer is specified.
全局内存允许应用程序读取和写入内存中的任意位置。使用全局内存时，来自 stream kernel 的此类读取和写入操作是使用常规 GPU 计算设备指令完成的，全局内存用作指令的源或目标。编程接口类似于 CPU 程序使用的加载/存储操作，其中指定了读/写缓冲区中的相对地址。

When using a global memory, each work-item can write to an arbitrary location within it. Global memory use a linear layout. If consecutive addresses are written, the compute unit issues a burst write for more efficient memory access. Only read-only buffers, such as constants, are cached.
使用全局内存时，每个工作项都可以写入其中的任意位置。全局内存使用线性布局。如果写入连续地址，则计算单元会发出突发写入以实现更有效的内存访问。仅缓存只读缓冲区，例如常量。

5.5 Image Read/Write

Image reads are done by addressing the desired location in the input memory using the fetch unit. The fetch units can process either 1D or 2 D addresses. These addresses can be normalized or un-normalized. Normalized coordinates are between 0.0 and 1.0 (inclusive). For the fetch units to handle 2D addresses and normalized coordinates, pre-allocated memory segments must be bound to the fetch unit so that the correct memory address can be computed. For a single kernel invocation, up to 128 images can be bound at once for reading, and eight for writing. The maximum number of addresses is 8192x8192 for Evergreen and Northern Islands-based devices, 16384x16384 for SI-based products.
image 读取是通过使用获取单元寻址输入存储器中的所需位置来完成的。提取单元可以处理一维或二维地址。这些地址可以被规范化或非规范化。归一化坐标介于 0.0 和 1.0 (含) 之间。对于要处理 2D 地址和规范化坐标的提取单元，必须将预分配的内存段绑定到提取单元，以便可以计算正确的内存地址。对于单个内核调用，一次最多可以绑定 128 个 image 用于读取，8 个用于写入。对于基于 Evergreen 和 Northern Islands 的设备，最大地址数为 8192x8192。对于基于 SI 的产品，最大地址数为 16384x16384。

Evergreen：常青树
Northern Islands：北方群岛

Image reads are cached through the texture system (corresponding to the L2 and L1 caches).
image 读取通过纹理系统进行缓存 (对应于 L2 和 L1 缓存)。

References

https://rocmdocs.amd.com/en/latest/Programming_Guides/Opencl-optimization.html
https://rocmdocs.amd.com/en/latest/Programming_Guides/Opencl-programming-guide.html
https://wiki.gentoo.org/wiki/AMDGPU
https://yongqiang.blog.csdn.net/article/details/122541573
https://yongqiang.blog.csdn.net/article/details/78590329

AMD APP SDK OpenCL User Guide
https://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_User_Guide2.pdf

Yongqiang Cheng

关注

3
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录