CUDA C++ Programming Guide——编程接口 CUDA Runtime Memory

最新推荐文章于 2024-07-24 03:00:55 发布

肥叔菌

最新推荐文章于 2024-07-24 03:00:55 发布

阅读量5.4k

点赞数 1

分类专栏： # CUDA 文章标签： 1024程序员节

本文链接：https://blog.csdn.net/asmartkiller/article/details/108962073

版权

CUDA 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

CUDA Runtime

运行时在与应用程序链接的cudart库中实现，可以通过cudart.lib或libcudart.a静态地实现，也可以通过cudart.dll或libcudart.so动态地实现。需要cudart.dll和/或cudart.so进行动态链接的应用程序通常会将它们作为应用程序安装包的一部分包含在内。只有在链接到CUDA运行时的同一实例的组件之间传递CUDA运行时符号的地址才是安全的。它的所有入口点都以cuda开头。
如异构编程中提到的那样，CUDA编程模型假定一个由主机和设备组成的系统，每个主机和设备都有各自独立的内存。设备内存概述了用于管理设备内存的运行时功能。共享内存说明了在线程层次结构中引入的共享内存的使用，以最大限度地提高性能。页面锁定主机内存引入了页面锁定主机内存，这是将内核执行与主机和设备内存之间的数据传输重叠所必需的overlap kernel execution with data transfers between host and device memory。异步并发执行描述了用于在系统的各个级别上启用异步并发执行的概念和API。多设备系统展示了编程模型如何扩展到将多个设备连接到同一主机的系统。错误检查介绍了如何正确检查运行时生成的错误。调用堆栈提到用于管理CUDA C ++调用堆栈的运行时函数。纹理和表面内存提供了纹理和表面内存空间，这些空间提供了另一种访问设备内存的方式。它们还公开了一部分GPU纹理化硬件。图形互操作性介绍了运行时提供的与两种主要图形API（OpenGL和Direct3D）互操作的功能。

Initialization

运行时没有显式的初始化函数；它会在首次调用运行时函数时进行初始化（更具体地说，是参考手册中错误处理和版本管理部分中的功能以外的任何功能）。在定时运行时函数调用以及解释从第一次调用到运行时的错误代码时，需要记住这一点。
运行时为系统中的每个设备创建一个CUDA上下文context（有关CUDA上下文的更多详细信息，请参阅上下文）。此上下文是此设备的主要上下文，并在需要此设备上的活动上下文的第一个运行时函数处初始化。 它在应用程序的所有主机线程之间共享。作为此上下文创建的一部分，设备代码会在必要时进行实时编译（请参阅“实时编译”）并加载到设备内存中。这一切都是透明发生的。如果需要，例如对于驱动程序API的互操作性，可以如运行时和驱动程序API的互操作性中所述，从驱动程序API访问设备的主要上下文。
当主机线程调用cudaDeviceReset()时，这会破坏该主机线程当前在其上操作的设备（即，如“设备选择”中定义的当前设备）的主要上下文。具有该设备作为当前设备的任何主机线程进行的下一个运行时函数调用都将为此设备创建一个新的主上下文。
注意：CUDA接口使用全局状态，该状态在主机程序启动期间初始化，并在主机程序终止期间销毁。 CUDA运行时和驱动程序无法检测到该状态是否无效，因此在程序启动或在main之后终止时使用这些接口中的任何一个（隐式或显式）都会导致不确定的行为。

Device Memory

如异构编程中提到的那样，CUDA编程模型假定一个由主机和设备组成的系统，每个主机和设备都有各自独立的内存。内核在设备内存之外运行，因此运行时提供了分配，取消分配和复制设备内存以及在主机内存和设备内存之间传输数据的功能。设备内存可以分配为线性内存(linear memory)或CUDA阵列(CUDA array)。CUDA数组是为纹理获取而优化的不透明内存布局。它们在“纹理和表面内存”中进行了描述。
线性内存分配在一个统一的地址空间中，这意味着单独分配的实体可以通过指针相互引用，例如在二叉树或链表中。地址空间的大小取决于主机系统（CPU）和所用GPU的计算能力。
在这里插入图片描述
注意：在计算能力5.3（Maxwell）和更早版本的设备上，CUDA驱动程序创建未提交的40位虚拟地址保留，以确保内存分配（指针）落入支持的范围内。此保留显示为保留的虚拟内存，但在程序实际分配内存之前不会占用任何物理内存。
线性内存通常使用cudaMalloc()分配，并使用cudaFree()释放，并且主机内存和设备内存之间的数据传输通常使用cudaMemcpy()完成。在内核的向量加法代码示例中，需要将向量从主机存储器复制到设备存储器：

// Device code
__global__ void VecAdd(float* A, float* B, float* C, int N)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N)
        C[i] = A[i] + B[i];
}
            
// Host code
int main()
{
    int N = ...;
    size_t size = N * sizeof(float);

    // Allocate input vectors h_A and h_B in host memory
    float* h_A = (float*)malloc(size);
    float* h_B = (float*)malloc(size);

    // Initialize input vectors
    ...

    // Allocate vectors in device memory
    float* d_A;
    cudaMalloc(&d_A, size);
    float* d_B;
    cudaMalloc(&d_B, size);
    float* d_C;
    cudaMalloc(&d_C, size);

    // Copy vectors from host memory to device memory
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Invoke kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =
            (N + threadsPerBlock - 1) / threadsPerBlock;
    VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // Copy result from device memory to host memory
    // h_C contains the result in host memory
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
            
    // Free host memory
    ...
}

线性内存也可以通过cudaMallocPitch()和cudaMalloc3D()进行分配。建议将这些功能用于2D或3D array的分配，因为它可以确保适当地填充分配以满足设备内存访问中描述的对齐要求，从而确保在访问行地址或执行2D array与其他设备内存区域之间的副本时的最佳性能（使用cudaMemcpy2D()和cudaMemcpy3D()函数）。返回的pitch（或步幅stride）必须用于访问数组元素。下面的代码示例分配一个宽度x高度的2D浮点值数组，并显示如何在设备代码中循环遍历数组元素：

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
                width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);

// Device code
__global__ void MyKernel(float* devPtr,
                         size_t pitch, int width, int height)
{
    for (int r = 0; r < height; ++r) {
        float* row = (float*)((char*)devPtr + r * pitch);
        for (int c = 0; c < width; ++c) {
            float element = row[c];
        }
    }
}

以下代码示例分配了一个宽度x高度x深度3D浮点值数组，并显示了如何在设备代码中循环遍历数组元素：

// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent(width * sizeof(float),
                                    height, depth);
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D(&devPitchedPtr, extent);
MyKernel<<<100, 512>>>(devPitchedPtr, width, height, depth);

// Device code
__global__ void MyKernel(cudaPitchedPtr devPitchedPtr,
                         int width, int height, int depth)
{
    char* devPtr = devPitchedPtr.ptr;
    size_t pitch = devPitchedPtr.pitch;
    size_t slicePitch = pitch * height;
    for (int z = 0; z < depth; ++z) {
        char* slice = devPtr + z * slicePitch;
        for (int y = 0; y < height; ++y) {
            float* row = (float*)(slice + y * pitch);
            for (int x = 0; x < width; ++x) {
                float element = row[x];
            }
        }
    }
}

注意：为避免分配过多的内存，从而影响系统范围的性能，请根据问题的大小向用户请求分配参数。如果分配失败，则可以回退到其他较慢的内存类型（cudaMallocHost()，cudaHostRegister()等），或返回错误消息，告诉用户需要多少内存，但该内存被拒绝。如果您的应用程序由于某种原因无法请求分配参数，我们建议对支持它的平台使用cudaMallocManaged()。
该参考手册列出了用于在cudaMalloc分配的线性内存，cudaMallocPitch或cudaMalloc3D分配的线性内存，CUDA数组以及为全局或常量存储空间中声明的变量分配的内存之间复制内存的所有各种函数。以下代码示例说明了通过运行时API访问全局变量的各种方式：

__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data));
cudaMemcpyFromSymbol(data, constData, sizeof(data));

__device__ float devData;
float value = 3.14f;
cudaMemcpyToSymbol(devData, &value, sizeof(float));

__device__ float* devPointer;
float* ptr;
cudaMalloc(&ptr, 256 * sizeof(float));
cudaMemcpyToSymbol(devPointer, &ptr, sizeof(ptr));

cudaGetSymbolAddress用于检索指向为全局内存空间中声明的变量分配的内存的地址。通过cudaGetSymbolSize获得分配的内存大小。

Device Memory L2 Access Management

当CUDA内核重复访问全局内存中的数据区域时，可以认为此类数据访问是持久的。另一方面，如果仅访问数据一次，则可以将此类数据访问视为流式传输。

从CUDA 11.0开始，计算能力8.0及更高版本的设备具有影响L2缓存中数据持久性的能力，从而有可能提供更高的带宽和更低的对全局内存的延迟访问。

L2 cache Set-Aside for Persisting Accesses

可以留出一部分L2高速缓存以用于持久存储对全局存储器的数据访问。持久性访问优先考虑了L2缓存的预留部分的使用，而当L2部分未通过持久访问时，常规或流式传输对全局内存才可以使用。可以在以下限制内调整用于持久访问的L2缓存预留空间大小：

cudaGetDeviceProperties(&prop, device_id);                
size_t size = min(int(prop.l2CacheSize * 0.75), prop.persistingL2CacheMaxSize);
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, size); /* set-aside 3/4 of L2 cache for persisting accesses or the max allowed*/

将GPU配置为多实例GPU（MIG）模式时，将禁用L2缓存预留功能。使用多进程服务（MPS）时，cudaDeviceSetLimit无法更改L2缓存的预留大小。相反，只能在启动MPS服务器时通过环境变量CUDA_DEVICE_DEFAULT_PERSISTING_L2_CACHE_PERCENTAGE_LIMIT来指定预留大小。

L2 Policy for Persisting Accesses

访问策略窗口指定了全局存储器的连续区域以及L2高速缓存中的持久性属性，用于该区域内的访问。下面的代码示例显示了如何使用CUDA流设置L2持久访问窗口。

cudaStreamAttrValue stream_attribute;                                         // Stream level attributes data structure
stream_attribute.accessPolicyWindow.base_ptr  = reinterpret_cast<void*>(ptr); // Global Memory data pointer
stream_attribute.accessPolicyWindow.num_bytes = num_bytes;                    // Number of bytes for persistence access.
                                                                              // (Must be less than cudaDeviceProp::accessPolicyMaxWindowSize)
stream_attribute.accessPolicyWindow.hitRatio  = 0.6;                          // Hint for cache hit ratio
stream_attribute.accessPolicyWindow.hitProp   = cudaAccessPropertyPersisting; // Type of access property on cache hit
stream_attribute.accessPolicyWindow.missProp  = cudaAccessPropertyStreaming;  // Type of access property on cache miss.

//Set the attributes to a CUDA stream of type cudaStream_t
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);

当内核随后在CUDA流中执行时，与对其他全局内存位置的访问相比，全局内存范围[ptr…ptr + num_bytes]中的内存访问更有可能保留在L2高速缓存中。也可以为CUDA图形内核节点设置L2持久性，如下例所示：

cudaKernelNodeAttrValue node_attribute;                                     // Kernel level attributes data structure
node_attribute.accessPolicyWindow.base_ptr  = reinterpret_cast<void*>(ptr); // Global Memory data pointer
node_attribute.accessPolicyWindow.num_bytes = num_bytes;                    // Number of bytes for persistence access.
                                                                            // (Must be less than cudaDeviceProp::accessPolicyMaxWindowSize)
node_attribute.accessPolicyWindow.hitRatio  = 0.6;                          // Hint for cache hit ratio
node_attribute.accessPolicyWindow.hitProp   = cudaAccessPropertyPersisting; // Type of access property on cache hit
node_attribute.accessPolicyWindow.missProp  = cudaAccessPropertyStreaming;  // Type of access property on cache miss.
                                    
//Set the attributes to a CUDA Graph Kernel node of type cudaGraphNode_t
cudaGraphKernelNodeSetAttribute(node, cudaKernelNodeAttributeAccessPolicyWindow, &node_attribute);

hitRatio参数可用于指定接收hitProp属性的访问次数。在上面的两个示例中，全局内存区域（ptr…ptr + num_bytes）中60％的内存访问具有持久性，而40％的内存访问具有流媒体属性。哪些特定的内存访问被归类为持久性存储（hitProp）是随机的，概率约为hitRatio；概率分布取决于硬件体系结构和存储范围。
例如，如果L2预留的高速缓存大小为16KB，而accessPolicyWindow中的num_bytes为32KB：
在hitRatio为0.5的情况下，硬件将随机选择32KB窗口中的16KB作为持久存储并缓存在预留的L2缓存区域中。
使用hitRatio 1.0时，硬件将尝试将整个32KB窗口缓存在预留的L2缓存区域中。由于预留空间小于窗口，因此将逐出缓存行，以将最近使用的32KB数据中的16KB保留在L2缓存的预留部分中。

因此，hitRatio可用于避免缓存行颠簸，并总体上减少了移入和移出L2缓存的数据量。

低于1.0的hitRatio值可用于手动控制来自并发CUDA流的不同accessPolicyWindows可以在L2中缓存的数据量。例如，让L2预留的高速缓存大小为16KB；在争用共享L2资源时，两个不同CUDA流中的两个并发内核（每个都具有16KB的accessPolicyWindow，且两者的hitRatio值为1.0）可能会将彼此的缓存行逐出。但是，如果两个accessPolicyWindows的hitRatio值均为0.5，则它们将不太可能驱逐自己的或彼此持久的缓存行。

L2 Access Properties

为不同的全局内存数据访问定义了三种类型的访问属性：

cudaAccessPropertyStreaming：使用流属性streaming property进行的内存访问不太可能保留在L2高速缓存中，因为这些访问被优先逐出。
cudaAccessPropertyPersisting：具有持久属性发生的内存访问更有可能持久存储在L2缓存中，因为这些访问优先保留在L2缓存的预留部分中。
cudaAccessPropertyNormal：此访问属性将以前应用的持久访问属性强制重置为正常状态。先前的CUDA内核具有持久属性的内存访问可能在其预期用途后很长时间被保留在L2缓存中。这种使用后的持久性减少了不使用持久性的后续内核可使用的L2缓存数量。使用cudaAccessPropertyNormal属性重置访问属性窗口会删除先前访问的持久（优先保留）状态，就好像先前访问没有访问属性一样。

L2 Persistence Example

以下示例显示了如何为持久访问预留L2缓存，如何通过CUDA Stream在CUDA内核中使用预留L2缓存，然后重置L2缓存。

cudaStream_t stream;
cudaStreamCreate(&stream);                                                                  // Create CUDA stream

cudaDeviceProp prop;                                                                        // CUDA device properties variable
cudaGetDeviceProperties( &prop, device_id);                                                 // Query GPU properties
size_t size = min( int(prop.l2CacheSize * 0.75) , prop.persistingL2CacheMaxSize );
cudaDeviceSetLimit( cudaLimitPersistingL2CacheSize, size);                                  // set-aside 3/4 of L2 cache for persisting accesses or the max allowed

size_t window_size = min(prop.accessPolicyMaxWindowSize, num_bytes);                        // Select minimum of user defined num_bytes and max window size.

cudaStreamAttrValue stream_attribute;                                                       // Stream level attributes data structure
stream_attribute.accessPolicyWindow.base_ptr  = reinterpret_cast<void*>(data1);               // Global Memory data pointer
stream_attribute.accessPolicyWindow.num_bytes = window_size;                                // Number of bytes for persistence access
stream_attribute.accessPolicyWindow.hitRatio  = 0.6;                                        // Hint for cache hit ratio
stream_attribute.accessPolicyWindow.hitProp   = cudaAccessPropertyPersisting;               // Persistence Property
stream_attribute.accessPolicyWindow.missProp  = cudaAccessPropertyStreaming;                // Type of access property on cache miss

cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);   // Set the attributes to a CUDA Stream

for(int i = 0; i < 10; i++) {
    cuda_kernelA<<<grid_size,block_size,0,stream>>>(data1);                                 // This data1 is used by a kernel multiple times
}                                                                                           // [data1 + num_bytes) benefits from L2 persistence
cuda_kernelB<<<grid_size,block_size,0,stream>>>(data1);                                     // A different kernel in the same stream can also benefit
                                                                                            // from the persistence of data1

stream_attribute.accessPolicyWindow.num_bytes = 0;                                          // Setting the window size to 0 disable it
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);   // Overwrite the access policy attribute to a CUDA Stream
cudaCtxResetPersistingL2Cache();                                                            // Remove any persistent lines in L2 

cuda_kernelC<<<grid_size,block_size,0,stream>>>(data2);                                     // data2 can now benefit from full L2 in normal mode

Reset L2 Access to Normal

在使用之后很久，来自先前CUDA内核的持久L2缓存行都可能会持久存在于L2中。因此，将L2缓存重置为正常状态对于流或正常内存访问以正常优先级使用L2缓存很重要。持久访问可以通过三种方式重置为正常状态。

使用访问属性cudaAccessPropertyNormal重置先前的持久存储区域。
通过调用cudaCtxResetPersistingL2Cache将所有持久性L2缓存行重置为正常状态。
最终，未触及的线untouched lines会自动重置为正常状态。强烈建议不要依赖自动复位，因为自动复位所需的时间不确定。

Manage Utilization of L2 set-aside cache

在不同CUDA流中同时执行的多个CUDA内核可能具有分配给其流的不同访问策略窗口。但是，L2预留缓存部分在所有这些并发CUDA内核之间共享。结果，该预留缓存部分的净利用率是所有并发内核各自使用的总和。当持久性访问的数量超过预留的L2缓存容量时，将持久性访问指定为持久性的好处就减少了。
要管理预留二级缓存部分的利用率，应用程序必须考虑以下事项：二级预留缓存的大小、可以并发执行的CUDA内核、所有可能并发执行的CUDA内核的访问策略窗口、何时以及如何重新设置二级缓存，以允许正常或流式访问以同等优先级利用先前预留的二级缓存。

Query L2 cache Properties

与L2缓存相关的属性是cudaDeviceProp结构的一部分，可以使用CUDA运行时API cudaGetDeviceProperties查询。CUDA设备属性包括：
l2CacheSize：GPU上的可用二级缓存数量。
persistenceingL2CacheMaxSize：可以为持久内存访问而保留的L2高速缓存的最大数量。
accessPolicyMaxWindowSize：访问策略窗口的最大大小。

Control L2 Cache Set-Aside Size for Persisting Memory Access

使用CUDA运行时API cudaDeviceGetLimit查询用于持久存储访问的L2预留缓存大小，并使用CUDA运行时API cudaDeviceSetLimit作为cudaLimit进行设置。设置此限制的最大值是cudaDeviceProp :: persistingL2CacheMaxSize。

enum cudaLimit {
    /* other fields not shown */
    cudaLimitPersistingL2CacheSize
};

Shared Memory

如可变内存空间说明符中所述，共享内存是使用__shared__内存空间说明符分配的。如线程层次结构中提到并在“共享内存”中详细介绍的那样，共享内存预计比全局内存快得多。如以下矩阵乘法示例所示，它可用作暂存器存储器scratchpad memory（或软件托管的缓存software managed cache），以最大程度地减少对CUDA模块的全局存储器访问。
下面的代码示例是矩阵乘法的直接实现，没有利用共享内存。 每个线程读取A的一行和B的一列，并计算C的相应元素，如图7所示。因此，A是从全局内存中读取B.width倍，而B则是A.height倍。

// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.width + col)
typedef struct {
    int width;
    int height;
    float* elements;
} Matrix;

// Thread block size
#define BLOCK_SIZE 16

// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
    // Load A and B to device memory
    Matrix d_A;
    d_A.width = A.width; d_A.height = A.height;
    size_t size = A.width * A.height * sizeof(float);
    cudaMalloc(&d_A.elements, size);
    cudaMemcpy(d_A.elements, A.elements, size,
               cudaMemcpyHostToDevice);
    Matrix d_B;
    d_B.width = B.width; d_B.height = B.height;
    size = B.width * B.height * sizeof(float);
    cudaMalloc(&d_B.elements, size);
    cudaMemcpy(d_B.elements, B.elements, size,
               cudaMemcpyHostToDevice);

    // Allocate C in device memory
    Matrix d_C;
    d_C.width = C.width; d_C.height = C.height;
    size = C.width * C.height * sizeof(float);
    cudaMalloc(&d_C.elements, size);

    // Invoke kernel
    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
    dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
    MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

    // Read C from device memory
    cudaMemcpy(C.elements, d_C.elements, size,
               cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_A.elements);
    cudaFree(d_B.elements);
    cudaFree(d_C.elements);
}

// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
    // Each thread computes one element of C
    // by accumulating results into Cvalue
    float Cvalue = 0;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    for (int e = 0; e < A.width; ++e)
        Cvalue += A.elements[row * A.width + e]
                * B.elements[e * B.width + col];
    C.elements[row * C.width + col] = Cvalue;
}

Figure 7. Matrix Multiplication without Shared Memory
下面的代码示例是利用共享内存的矩阵乘法的实现。在这个实现中，每个线程块负责计算C的一个方子矩阵Csub，块中的每个线程负责计算Csub的一个元素。如图8所示，Csub等于两个矩形矩阵的乘积：与Csub具有相同行索引的维数（A.width，block_size）的子矩阵，以及与Csub具有相同列索引的维数为B的子矩阵（block_size，A.width）。为了适应设备的资源，将这两个矩形矩阵划分为尽可能多的维数块大小的方阵，并将Csub计算为这些方阵乘积的和。这些乘积中的每一个都是通过首先将两个对应的方阵从全局内存加载到共享内存，其中一个线程加载每个矩阵的一个元素，然后让每个线程计算乘积的一个元素。每个线程将这些产品的结果累加到一个寄存器中，完成后将结果写入全局内存。
通过这种方式阻塞计算，我们利用了快速共享内存，节省了大量的全局内存带宽，因为a只从全局内存中读取（B.width/block_size）次，而B被读取（a.height/block_size）次。
上一个代码示例中的矩阵类型通过一个步长字段进行扩充，这样子矩阵就可以用相同的类型有效地表示。__device__设备函数用于获取和设置元素，并从矩阵中构建任何子矩阵。

// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
    int width;
    int height;
    int stride; 
    float* elements;
} Matrix;

// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col)
{
    return A.elements[row * A.stride + col];
}

// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col,
                           float value)
{
    A.elements[row * A.stride + col] = value;
}

// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
 __device__ Matrix GetSubMatrix(Matrix A, int row, int col) 
{
    Matrix Asub;
    Asub.width    = BLOCK_SIZE;
    Asub.height   = BLOCK_SIZE;
    Asub.stride   = A.stride;
    Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row
                                         + BLOCK_SIZE * col];
    return Asub;
}

// Thread block size
#define BLOCK_SIZE 16

// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
    // Load A and B to device memory
    Matrix d_A;
    d_A.width = d_A.stride = A.width; d_A.height = A.height;
    size_t size = A.width * A.height * sizeof(float);
    cudaMalloc(&d_A.elements, size);
    cudaMemcpy(d_A.elements, A.elements, size,
               cudaMemcpyHostToDevice);
    Matrix d_B;
    d_B.width = d_B.stride = B.width; d_B.height = B.height;
    size = B.width * B.height * sizeof(float);
    cudaMalloc(&d_B.elements, size);
    cudaMemcpy(d_B.elements, B.elements, size,
    cudaMemcpyHostToDevice);

    // Allocate C in device memory
    Matrix d_C;
    d_C.width = d_C.stride = C.width; d_C.height = C.height;
    size = C.width * C.height * sizeof(float);
    cudaMalloc(&d_C.elements, size);

    // Invoke kernel
    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
    dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
    MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

    // Read C from device memory
    cudaMemcpy(C.elements, d_C.elements, size,
               cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_A.elements);
    cudaFree(d_B.elements);
    cudaFree(d_C.elements);
}

// Matrix multiplication kernel called by MatMul()
 __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
    // Block row and column
    int blockRow = blockIdx.y;
    int blockCol = blockIdx.x;

    // Each thread block computes one sub-matrix Csub of C
    Matrix Csub = GetSubMatrix(C, blockRow, blockCol);

    // Each thread computes one element of Csub
    // by accumulating results into Cvalue
    float Cvalue = 0;

    // Thread row and column within Csub
    int row = threadIdx.y;
    int col = threadIdx.x;

    // Loop over all the sub-matrices of A and B that are
    // required to compute Csub
    // Multiply each pair of sub-matrices together
    // and accumulate the results
    for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {

        // Get sub-matrix Asub of A
        Matrix Asub = GetSubMatrix(A, blockRow, m);

        // Get sub-matrix Bsub of B
        Matrix Bsub = GetSubMatrix(B, m, blockCol);

        // Shared memory used to store Asub and Bsub respectively
        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

        // Load Asub and Bsub from device memory to shared memory
        // Each thread loads one element of each sub-matrix
        As[row][col] = GetElement(Asub, row, col);
        Bs[row][col] = GetElement(Bsub, row, col);

        // Synchronize to make sure the sub-matrices are loaded
        // before starting the computation
        __syncthreads();
        // Multiply Asub and Bsub together
        for (int e = 0; e < BLOCK_SIZE; ++e)
            Cvalue += As[row][e] * Bs[e][col];

        // Synchronize to make sure that the preceding
        // computation is done before loading two new
        // sub-matrices of A and B in the next iteration
        __syncthreads();
    }

    // Write Csub to device memory
    // Each thread writes one element
    SetElement(Csub, row, col, Cvalue);
}

Matrix Multiplication with Shared Memory

Page-Locked Host Memory

运行时提供的函数允许使用页锁定（也称为固定）主机内存（与malloc分配的常规可分页主机内存不同）:cudaHostAlloc()和cudaFreeHost()分配和释放页锁定主机内存；cudahosteregister()页锁定malloc()分配的内存范围（有关限制，请参阅参考手册）。使用页锁定主机内存有几个好处：

页锁定主机内存和设备内存之间的复制可以与内核执行同时执行，如异步并发执行中所述。
在某些设备上，页锁定的主机内存可以映射到设备的地址空间，从而消除了将其复制到设备内存或从设备内存复制的需要，如映射内存中所述。
在具有前端总线的系统上，如果主机存储器被分配为页锁定，则主机存储器和设备存储器之间的带宽更高；如果另外将主机存储器分配为写入合并，则带宽更高，如写入组合存储器中所述。
但是，页锁定主机内存是一种稀缺资源，因此页锁定内存中的分配将在可分页内存中分配之前很久就开始失败。此外，通过减少操作系统可用于分页的物理内存量，消耗过多的页锁定内存会降低总体系统性能。
注意：页锁定主机内存不会缓存在非I/O一致的Tegra设备上。此外，在非I/O一致性Tegra设备上不支持CudaHosterRegister。简单的零拷贝CUDA示例附带了一个关于页面锁定内存API的详细文档。

Portable Memory

页锁定内存块可以与系统中的任何设备一起使用（有关多设备系统的更多详细信息，请参阅多设备系统），但默认情况下，上面描述的使用页锁定内存的好处仅与分配块时的当前设备一起使用（并且所有设备共享相同的统一地址空间，如果有的话，如在统一虚拟地址空间中所述）。为了使这些优势对所有设备都可用，需要通过将标志cudaHostAllocPortable传递给cudaHostAlloc()来分配块，或者通过将标志cudahosterregisterportable传递给cudahosterregister()来分配块。

Write-Combining Memory

默认情况下，页面锁定的主机内存被分配为可缓存的。通过将标志cudaHostAllocWriteCombined传递给cudaHostAlloc()，可以选择将其分配为写组合。写组合内存释放主机的一级和二级缓存资源，使更多的缓存可用于应用程序的其余部分。此外，在通过PCI Express总线传输时，写入组合内存不会被窥探，这可以将传输性能提高40%。从主机读取写入组合内存的速度非常慢，因此写入组合内存通常应用于主机只写入的内存。

Mapped Memory

页锁定的主机内存块也可以通过传递标记cudaHostAllocMapped给cudaHostAlloc()或传递标记cudaHostRegisterMapped给cudahosterregister()两种方法映射到设备的地址空间。这样的块通常有两个地址：一个在由cudaHostAlloc()或malloc()返回的主机内存中，另一个在设备内存中，可以使用cudaHostGetDevicePointer()检索，然后用于从内核中访问块。唯一的例外是使用cudaHostAlloc()分配的指针以及主机和设备使用统一的地址空间时。
直接从内核中访问主机内存不能提供与设备内存相同的带宽，但确实有一些优点：
不需要在设备内存中分配块，也不需要在该块和主机内存中的块之间复制数据；数据传输是根据内核的需要隐式执行的；
不需要使用流（请参阅并发数据传输）来将数据传输与内核执行重叠；源于内核的数据传输会自动与内核执行重叠。

但是，由于映射页锁定内存在主机和设备之间共享，应用程序必须使用流或事件同步内存访问（请参阅异步并发执行），以避免任何潜在的先读后写、先读后写或先写后写的危险。
要能够检索到指向任何映射页锁定内存的设备指针，必须在执行任何其他CUDA调用之前通过使用cudaDeviceMapHost标志调用cudaSetDeviceFlags()来启用页锁定内存映射。否则，cudaHostGetDevicePointer()将返回错误。
如果设备不支持映射页锁定主机内存，cudaHostGetDevicePointer()也会返回错误。应用程序可以通过检查canMapHostMemory设备属性（请参阅设备枚举）来查询此功能，对于支持映射页锁定主机内存的设备，该属性等于1。
请注意，从主机或其他设备的角度来看，在映射页锁定内存上操作的原子函数（请参阅原子函数）不是原子函数。
还请注意，CUDA运行时要求从设备启动的1字节、2字节、4字节和8字节自然对齐的加载和存储作为从主机和其他设备的角度来看的单一访问。在某些平台上，原子到内存可能会被硬件分解成单独的加载和存储操作。这些组件加载和存储操作对保持自然对齐的访问具有相同的要求。例如，CUDA运行时不支持PCI Express总线拓扑，其中PCI Express网桥将8字节自然对齐的写入拆分为设备和主机之间的两个4字节写入。