cuda sample(2)矩阵乘法

最新推荐文章于 2023-09-25 15:35:14 发布

不爱吃香菇的干饭少年

最新推荐文章于 2023-09-25 15:35:14 发布

阅读量510

点赞数 1

分类专栏： cuda 文章标签：矩阵人工智能线性代数

本文链接：https://blog.csdn.net/qq_43594926/article/details/131583772

版权

cuda 专栏收录该内容

34 篇文章 8 订阅

订阅专栏

// CUDA示例演示了一个__nv_bfloat16 (E8M7) GEMM计算，使用的是CUDA 11.0中引入的Warp Matrix Multiply 和Accumulate API。

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v11.0\0_Simple\bf16TensorCoreGemm

在这个程序中，compute_gemm内核计算了一个矩阵乘法和加法的结果：

D = alpha * A * B + beta * C. C和D矩阵的尺寸都是M_GLOBAL x N_GLOBAL。 A矩阵是M_GLOBAL x K_GLOBAL（行为主），B矩阵是K_GLOBAL x N_GLOBAL（列为主）。

// 在该内核中，每个CTA每次迭代都计算一个128 x 128的结果矩阵的瓦片。当计算完成后，CTA将其存储到全局存储器中，并开始新的迭代，选择一个新的128 x 128图块进行计算。

// 每个CTA由8个翘板组成。对于128 x 128瓦片，每个经线计算8个16 x 16的子瓦片，组织在一个2 x 4的二维阵列中。
// Warps使用nvcuda::wmma::mma_sync操作计算16 x 16子片，通过移动A和B矩阵的K_GLOBAL维度，在本地线程状态中积累中间结果。

算法中使用了一些简单的优化方法：

CTA将C矩阵的128 x 128瓦片从全局内存复制到共享内存。完成后，每个经线从共享内存中加载C矩阵片段，从而避免了随机的全局内存访问。

// - 在每个内部迭代中，

CTA将A和B矩阵的一部分从全局内存复制到共享内存。之后，CTA中的所有翘曲都会重复使用共享内存中的A和B数据，从而减少从全局内存中复制数据的次数。

A和B矩阵的部分被存储在共享内存中，有一个额外的填充（倾斜），以减少共享内存访问库冲突的数量。(见SKEW_BF16宏定义附近的详细解释）。

当CTA完成计算所得矩阵的瓦片时，每个经线将其子瓦片存储到共享内存中。然后CTA将共享内存的内容复制到全局内存，再次避免了多余的随机全局内存访问。
请注意，

CTA瓦片大小的选择是为了最大限度地提高GPU寄存器的利用率，但要足够谨慎以避免本地内存的使用。

代码片段1

    if (deviceProp.major < 8) {
        printf("bf16TensorCoreGemm requires requires SM 8.0 or higher to use Tensor Cores.  Exiting...\n");
        exit(EXIT_WAIVED);
    }

上面给出的代码片段是一个条件判断语句，用于检查计算设备的主要计算能力（compute capability）是否小于 8。如果计算设备的主要计算能力小于 8，则输出一条错误信息，并调用 exit() 函数退出程序。

具体解释如下：

deviceProp 是一个包含计算设备属性信息的结构体变量，其中 major 字段表示设备的主要计算能力版本号。

8 是一个表示计算能力版本号的阈值，这里用来检查设备的主要计算能力是否小于 8。

printf() 函数用于在标准输出中打印一条错误信息，提示用户计算设备的主要计算能力不满足要求。

exit() 函数被调用，并传递 EXIT_WAIVED 作为参数，表示以被放弃（waived）的方式退出程序。EXIT_WAIVED 是一个特定的退出状态码，用于表示程序因为某种限制条件而被放弃执行。

通过这段代码，如果计算设备的主要计算能力小于 8，程序将在打印错误信息后立即退出，并返回 EXIT_WAIVED 作为退出状态码，以便在后续的处理中识别程序被放弃执行的情况。

代码片段2

#if CPU_DEBUG
    result_hD   = (float*) malloc(sizeof(float) * M_GLOBAL * N_GLOBAL);
    result_host = (float*) malloc(sizeof(float) * M_GLOBAL * N_GLOBAL);
#endif

    __nv_bfloat16 *A = NULL;
    __nv_bfloat16 *B = NULL;
    float *C = NULL;
    float *D = NULL;

    checkCudaErrors(cudaMalloc((void**)&A, sizeof(__nv_bfloat16) * M_GLOBAL * K_GLOBAL));
    checkCudaErrors(cudaMalloc((void**)&B, sizeof(__nv_bfloat16) * N_GLOBAL * K_GLOBAL));
    checkCudaErrors(cudaMalloc((void**)&C, sizeof(float) * M_GLOBAL * N_GLOBAL));
    checkCudaErrors(cudaMalloc((void**)&D, sizeof(float) * M_GLOBAL * N_GLOBAL));

    assert(((unsigned long long)A) % 128 == 0);
    assert(((unsigned long long)B) % 128 == 0);
    assert(((unsigned long long)C) % 128 == 0);
    assert(((unsigned long long)D) % 128 == 0);

CUDA程序中的内存分配和断言操作。以下是对代码的解释：

首先，根据条件 CPU_DEBUG 的值，决定是否需要为 result_hD 和 result_host 分配主机内存空间。这些变量的分配大小根据 M_GLOBAL 和 N_GLOBAL 的值确定。

__nv_bfloat16 是NVIDIA CUDA的浮点类型，用于存储16位浮点数（bfloat16）。A、B、C、D 是指向设备（GPU）内存的指针，用于存储在设备上分配的内存空间的地址

使用 cudaMalloc() 函数为指针变量 A、B、C、D 在设备上分配内存空间。分配的大小根据 M_GLOBAL、N_GLOBAL 和 K_GLOBAL 的值确定。

assert() 是一个宏，用于断言检查。通过断言操作（assert()），确保分配的设备内存地址 A、B、C、D 满足特定的对齐要求。这里的对齐要求是地址能被 128 整除，使用断言来检查是否满足该要求。

这段代码的目的是为后续的CUDA计算做好内存准备工作，并使用断言来检查设备内存的对齐性。通过断言，可以在程序运行时检查内存分配的正确性，确保满足特定的对齐要求，以避免潜在的错误。

代码片段3

    init_host_matrices(A_h, B_h, C_h);

    printf("Preparing data for GPU...\n");

    checkCudaErrors(cudaMemcpy(A, A_h, sizeof(__nv_bfloat16) * M_GLOBAL * K_GLOBAL, cudaMemcpyHostToDevice));
    checkCudaErrors(cudaMemcpy(B, B_h, sizeof(__nv_bfloat16) * N_GLOBAL * K_GLOBAL, cudaMemcpyHostToDevice));
    checkCudaErrors(cudaMemcpy(C, C_h, sizeof(float) * M_GLOBAL * N_GLOBAL, cudaMemcpyHostToDevice));
    checkCudaErrors(cudaMemset(D, 0, sizeof(float) * M_GLOBAL * N_GLOBAL));

    enum {
        // Compute the right amount of shared memory to request.
        // We need shared memory to hold per-CTA C and D matrix tiles, and to cache per-CTA chunks
        // of the A and B matrices. Therefore, the right amount to request is the maximum of those
        // two numbers.
        SHMEM_SZ = MAX(sizeof(__nv_bfloat16) * (BLOCK_COL_TILES * M) * (CHUNK_K * K + SKEW_BF16) * 2,
                       M * (BLOCK_ROW_WARPS * WARP_ROW_TILES) * N * (BLOCK_COL_WARPS * WARP_COL_TILES) * sizeof(float))
    };

    printf("Required shared memory size: %lu Kb\n", SHMEM_SZ / 1024UL);

以上代码是准备GPU数据和计算所需共享内存大小的代码

init_host_matrices(A_h, B_h, C_h); 是一个函数调用，用于在主机（CPU）上初始化输入数据，将数据存储在相应的主机内存数组 A_h、B_h 和 C_h 中。
printf(“Preparing data for GPU…\n”); 打印一条提示信息，表明正在准备数据以供GPU使用。
cudaMemset() 函数用于将设备（GPU）内存块的内容设置为指定的值。在这里，将设备内存块 D 的内容设置为 0。

enum 声明中，需要指定一个或多个枚举常量，并且每个常量都可以具有一个可选的关联值
enum MyEnum {
VALUE_1,
VALUE_2 = 5,
VALUE_3 = 10
};
在这个示例中，MyEnum 是一个枚举类型，定义了三个枚举常量 VALUE_1、VALUE_2 和 VALUE_3，它们的默认关联值分别为 0、5 和 10。
`

定义了一个枚举常量 SHMEM_SZ，用于计算所需的共享内存大小。共享内存的大小取决于需要存储的矩阵块的数量和大小，以及缓存的数据量。

SHMEM_SZ 是一个常量，表示所需的共享内存大小。
sizeof(__nv_bfloat16) * (BLOCK_COL_TILES * M) * (CHUNK_K * K + SKEW_BF16) * 2 表示为存储每个 CTA 的 C 和 D 矩阵块所需的共享内存大小。其中 BLOCK_COL_TILES 是列方向上的块数，
M 是每个块的行数，
CHUNK_K 是块内每个线程块负责的 K 的大小，K 是矩阵 B 的列数，
SKEW_BF16 是偏移量。
M * (BLOCK_ROW_WARPS * WARP_ROW_TILES) * N * (BLOCK_COL_WARPS * WARP_COL_TILES) * sizeof(float) 表示为缓存每个 CTA 的 A 和 B 矩阵块所需的共享内存大小。
其中 M 是每个块的行数，
BLOCK_ROW_WARPS 是行方向上的线程块数，
WARP_ROW_TILES 是每个线程块的行数，
N 是矩阵 B 的列数，
BLOCK_COL_WARPS 是列方向上的线程块数，
WARP_COL_TILES 是每个线程块的列数。
MAX(a, b) 是一个宏，用于返回 a 和 b 中的较大值。在这里，SHMEM_SZ 的值被设置为上述两个表达式中的较大值，以确保分配的共享内存大小足够容纳所需的数据。

printf(“Required shared memory size: %lu Kb\n”, SHMEM_SZ / 1024UL); 打印所需的共享内存大小，以 KB 为单位。

这段代码的目的是准备数据以供GPU计算使用。它包括将主机上的输入数据拷贝到设备内存中，设置设备内存的初始值，并计算所需的共享内存大小，以确保GPU计算过程中有足够的内存资源。

代码片段4

__host__ void init_host_matrices(__nv_bfloat16 *a, __nv_bfloat16 *b, float *c)
{
    for (int i = 0; i < M_GLOBAL; i++) {
        for (int j = 0; j < K_GLOBAL; j++) {
            a[i*K_GLOBAL+j] = (__nv_bfloat16)(rand() % 3);
        }
    }

    for (int i = 0; i < N_GLOBAL; i++) {
        for (int j = 0; j < K_GLOBAL; j++) {
            b[i*K_GLOBAL+j] = (__nv_bfloat16)(rand() % 3);
        }
    }

    for (int t = 0; t < M_GLOBAL * N_GLOBAL; t++) {
        c[t] =  (float)(rand() % 3);
    }
}

定义了一个名为 init_host_matrices 的函数，用于在主机（CPU）上初始化输入矩阵。

函数的工作流程如下：

函数接收三个参数：指向 a 数组的指针、指向 b 数组的指针、以及指向 c 数组的指针。
函数使用两个嵌套的循环，遍历 a 数组的每个元素，并为每个元素赋予一个随机生成的 __nv_bfloat16 类型的值。

使用 (rand() % 3) 生成一个随机整数，然后将其强制转换为 __nv_bfloat16 类型，并将结果赋值给 a 数组的元素 a[i*K_GLOBAL+j]。
(rand() % 3) 表示取 0 到 2 之间的随机整数。由于 rand() 函数返回一个伪随机数，通过取余操作可以将结果限制在 0、1 和 2 之间。然后，通过将其强制转换为 __nv_bfloat16 类型。
这个过程将随机生成的整数赋值给 a 数组的元素，以初始化主机上的输入矩阵的元素。需要注意的是，根据具体情况，可能需要对生成的随机数进行进一步的缩放或转换，以适应 __nv_bfloat16 类型的范围和精度要求。

函数使用两个嵌套的循环，遍历 b 数组的每个元素，并为每个元素赋予一个随机生成的 __nv_bfloat16 类型的值。
函数使用一个循环，遍历 c 数组的每个元素，并为每个元素赋予一个随机生成的 float 类型的值。

该函数的目的是在主机上生成随机的输入矩阵数据，用于后续在GPU上进行计算。它通过循环遍历每个矩阵元素，并为其赋予随机生成的值，以模拟真实的输入数据。

代码片段5

 cudaEvent_t start, stop;

    checkCudaErrors(cudaEventCreate(&start));    
    checkCudaErrors(cudaEventCreate(&stop));
    checkCudaErrors(cudaEventRecord(start));

cudaEvent_t start, stop; 声明了两个CUDA事件对象 start 和 stop，用于计时和记录时间。
checkCudaErrors(cudaEventCreate(&start)); checkCudaErrors(cudaEventCreate(&stop)); 用于创建CUDA事件对象，并通过调用 cudaEventCreate() 函数将其与相应的事件句柄关联起来。
checkCudaErrors(cudaEventRecord(start)); 用于记录开始时间点。通过调用 cudaEventRecord() 函数将当前时间记录到 start 事件对象中，以用于后续计时。

代码片段6

 kernels selected_kernel = bf16mma_shmem_gemm_async_copy;

    if (checkCmdLineFlag(argc, (const char **)argv, "kernel")) {
        int kernel_number = getCmdLineArgumentInt(argc, (const char **)argv, "kernel");
        if (kernel_number < 3) {
            selected_kernel = (kernels)kernel_number;
        }
        else {
            printf("Error: kernel number should be between 0 to 2, you have entered %d\n", kernel_number);
            exit(EXIT_FAILURE);
        }
    }

kernels selected_kernel = bf16mma_shmem_gemm_async_copy; 声明了一个名为 selected_kernel 的枚举类型变量，并将其初始化为 bf16mma_shmem_gemm_async_copy 枚举常量。

后续的代码段用于检查命令行参数并根据命令行输入选择相应的内核。通过调用 checkCmdLineFlag() 和 getCmdLineArgumentInt() 函数，可以检查命令行中是否指定了一个名为 “kernel” 的参数，并获取对应的内核编号。如果指定的内核编号有效，则将 selected_kernel 更新为对应的内核值；否则，输出错误信息并调用 exit(EXIT_FAILURE) 终止程序。

checkCmdLineFlag() 和 getCmdLineArgumentInt() 是一对用于处理命令行参数的辅助函数，通常用于检查和获取命令行参数的值。这些函数通常与CUDA示例代码和工具一起使用。

checkCmdLineFlag() 函数用于检查命令行参数是否存在。它接受命令行参数的数量、命令行参数的数组以及要检查的参数名称作为参数。如果指定的命令行参数名称在命令行参数中存在，则返回 true；否则返回 false。

getCmdLineArgumentInt() 函数用于获取命令行参数的整数值。它接受命令行参数的数量、命令行参数的数组、要获取的参数名称以及默认值作为参数。如果指定的命令行参数存在并且可以解析为整数，则返回解析后的整数值；否则返回提供的默认值。
这些函数可以用于在CUDA程序中处理命令行参数，例如根据命令行参数选择不同的内核、设置特定的运行模式或调整算法的参数等。通过这些函数，可以轻松地检查和提取命令行参数的值，以便在程序中根据需求进行适当的配置和操作。

以上代码5和6的目的是创建CUDA事件对象并记录开始时间点，并根据命令行参数选择要执行的内核。这样可以在后续的程序执行中测量时间和选择相应的计算内核。

代码片段7

 if ((deviceProp.sharedMemPerMultiprocessor >= SHMEM_SZ) && (selected_kernel != simple_bf16mma_gemm)) {
        printf("Computing using high performance kernel = %d - %s\n", selected_kernel, kernelNames[selected_kernel]);

        switch (selected_kernel)
        {
            case bf16mma_shmem_gemm_async_copy :
            default:
                checkCudaErrors(cudaFuncSetAttribute(compute_bf16gemm_async_copy, cudaFuncAttributeMaxDynamicSharedMemorySize, SHMEM_SZ));
                checkKernelErrors((compute_bf16gemm_async_copy<<<deviceProp.multiProcessorCount*2, THREADS_PER_BLOCK, SHMEM_SZ>>>(A, B, C, D, alpha, beta)));
                break;
            case bf16mma_shmem_gemm :
                checkCudaErrors(cudaFuncSetAttribute(compute_bf16gemm, cudaFuncAttributeMaxDynamicSharedMemorySize, SHMEM_SZ));
                checkKernelErrors((compute_bf16gemm<<<deviceProp.multiProcessorCount*2, THREADS_PER_BLOCK, SHMEM_SZ>>>(A, B, C, D, alpha, beta)));
                break;
        }
#if CPU_DEBUG
        checkCudaErrors(cudaMemcpy(result_hD, D, sizeof(float)*M_GLOBAL*N_GLOBAL, cudaMemcpyDeviceToHost));
#endif
    }
    else {
        dim3 gridDim;
        dim3 blockDim;
     
        // blockDim.x must be a multple of warpSize
        // 128x4 means we have 16 warps and a block computes a 64x64 output tile
        blockDim.x = 128;
        blockDim.y = 4;

        gridDim.x = (M_GLOBAL + (M * blockDim.x / 32 - 1)) / (M * blockDim.x / 32);
        gridDim.y = (N_GLOBAL + N * blockDim.y - 1) / (N * blockDim.y);

        printf("Computing... using simple_wmma_gemm kernel\n");
        simple_wmma_bf16gemm<<<gridDim, blockDim>>>(A, B, C, D, M_GLOBAL, N_GLOBAL, K_GLOBAL, alpha, beta);
#if CPU_DEBUG
        checkCudaErrors(cudaMemcpy(result_hD, D, sizeof(float) * M_GLOBAL * N_GLOBAL, cudaMemcpyDeviceToHost));
#endif
    }

    checkCudaErrors(cudaEventRecord(stop));
    checkCudaErrors(cudaEventSynchronize(stop));

根据条件选择不同的计算内核，并启动相应的CUDA内核函数来进行计算它包含了一些条件分支和CUDA函数调用，以便根据不同的情况执行不同的计算操作，并记录计算时间和拷贝结果数据（根据需要）

首先，代码通过检查设备的共享内存大小和所选的内核类型来决定使用哪个内核进行计算。条件 deviceProp.sharedMemPerMultiprocessor >= SHMEM_SZ 检查每个多处理器可用的共享内存是否足够存储所需的共享内存大小。条件 selected_kernel != simple_bf16mma_gemm 确保选择的内核不是简单的内核类型，则使用高性能的内核进行计算。否则，使用简单的 simple_wmma_bf16gemm 内核进行计算。
在使用高性能内核进行计算时，根据所选的内核类型，使用 cudaFuncSetAttribute() 函数设置内核函数的属性，将共享内存大小 SHMEM_SZ 传递给内核函数。然后，通过调用相应的内核函数来执行计算。
如果启用了 CPU_DEBUG 宏定义，则使用 cudaMemcpy() 将设备上的结果数据 D 拷贝到主机上的 result_hD 数组。
在使用简单内核进行计算时，确定了适当的网格和块的维度。然后，通过调用 simple_wmma_bf16gemm 内核函数来执行计算。

针对 bf16mma_shmem_gemm_async_copy 内核类型，调用了名为 compute_bf16gemm_async_copy 的内核函数。
以下是对代码的解释：
case bf16mma_shmem_gemm_async_copy: 指定了一个内核类型 bf16mma_shmem_gemm_async_copy 的情况。
default: 是默认情况，用于处理未匹配到其他情况的情况。
checkCudaErrors(cudaFuncSetAttribute(compute_bf16gemm_async_copy, cudaFuncAttributeMaxDynamicSharedMemorySize, SHMEM_SZ)); 通过调用 cudaFuncSetAttribute() 函数设置内核函数 compute_bf16gemm_async_copy 的属性，其中包括最大动态共享内存大小。SHMEM_SZ 被用作动态共享内存大小的值。
checkKernelErrors((compute_bf16gemm_async_copy<<<deviceProp.multiProcessorCount*2, THREADS_PER_BLOCK, SHMEM_SZ>>>(A, B, C, D, alpha, beta))); 调用 compute_bf16gemm_async_copy 内核函数，并使用 <<<…>>> 语法指定内核函数的执行配置，其中包括网格大小、线程块大小和动态共享内存大小。

这段代码的目的是针对 bf16mma_shmem_gemm_async_copy 内核类型，执行名为 compute_bf16gemm_async_copy 的内核函数，并通过设置适当的属性和执行配置来优化内核的执行。根据具体的内核类型，可以针对不同的计算需求和硬件架构选择最佳的内核实现。

cudaFuncSetAttribute() 函数

cudaFuncSetAttribute() 函数用于设置 CUDA 内核函数的属性。它允许在运行时为内核函数设置一些额外的属性，以影响其执行行为和性能。
`cudaError_t cudaFuncSetAttribute(
const void* func,
cudaFuncAttribute attr,
int value
);

func：指向要设置属性的内核函数的指针。
attr：要设置的属性类型，是 cudaFuncAttribute 枚举类型的值。常见的属性包括：
cudaFuncAttributeMaxDynamicSharedMemorySize：最大动态共享内存大小。
cudaFuncAttributePreferredSharedMemoryCarveout：首选的共享内存与L1缓存的比例。
cudaFuncAttributeMaxThreadsPerBlock：最大线程数。
value：属性的值，根据不同的属性类型而有所不同。
该函数用于设置内核函数的属性，以便根据应用的需要进行优化和配置。例如，可以使用 cudaFuncAttributeMaxDynamicSharedMemorySize 属性来设置内核函数的最大动态共享内存大小，从而控制内核函数中动态共享内存的使用量。其他属性可以用于控制最大线程数、共享内存与L1缓存的比例等

在计算完成后，通过调用 cudaEventRecord() 记录停止时间，并通过调用 cudaEventSynchronize() 等待事件完成。
最后，根据需要，通过调用 cudaMemcpy() 将设备上的结果数据拷贝到主机上的 result_hD 数组中（仅在定义了 CPU_DEBUG 时才执行）。

代码片段8 compute_bf16gemm_async_copy 内核函数

__global__ void compute_bf16gemm_async_copy(const __nv_bfloat16 *A, const __nv_bfloat16 *B, const float *C, float *D, float alpha, float beta)
{
#if __CUDA_ARCH__ >= 800
    extern __shared__ __nv_bfloat16 shmem[][CHUNK_K * K + SKEW_BF16];

    // Warp and lane identification.
    const unsigned int warpId = threadIdx.x / WARP_SIZE;
    const unsigned int laneId = threadIdx.x % WARP_SIZE;

    // Offset in shared memory from which the B matrix is stored.
    const size_t shmem_idx_b_off = BLOCK_COL_TILES * M;

    // This pointer is used to access the C and D matrix tiles this warp computes.
    float *shmem_warp_tile_ptr = (float*)&shmem[0][0] + (warpId / BLOCK_ROW_WARPS) * SHMEM_STRIDE * N * BLOCK_ROW_WARPS + (warpId % BLOCK_ROW_WARPS) * SHMEM_OFFSET;

    // This pointer is used to stream the C and D matrices block-wide tile to and from shared memory.
    float *shmem_warp_stream_ptr = (float*)&shmem[0][0] + warpId * SHMEM_STRIDE * N;

    // Adjust the beta scaler, as it'll be multiplied by alpha at the end of
    // each tile computation. Technically this is not generally correct (may result
    // in a loss of precision). Zero still needs to be specially handled though.
    beta /= alpha;

#if USE_CPP_API
    nvcuda_namespace::pipeline pipe;
#endif
    // Each CTA slides along the 128 x 128 tiles from the top left corner of the matrix to the
    // right and down, and selects the next tile to compute. Once there's no such tile,
    // all warps in this CTA exit.
    for(unsigned int block_pos = blockIdx.x;; block_pos += gridDim.x) {
        const unsigned int block_tile_i = ((block_pos * BLOCK_ROW_TILES) / N_TILES) * (BLOCK_COL_TILES);
        const unsigned int block_tile_j = (block_pos * BLOCK_COL_TILES) % N_TILES;

        // Stop when there are no more D matrix tiles to compute in this CTA.
        if (block_tile_i >= M_TILES) {
            break;
        }

        // This warp's pointer to the C matrix data to copy memory from to shared memory.
        const size_t gmem_idx = (block_tile_i + warpId) * M * GLOBAL_MEM_STRIDE + block_tile_j * N;
        const float *src_gmem_warp_stream_ptr = &C[gmem_idx];

        // Stream multiple C tiles to shared memory.
#pragma unroll
        for (int i = 0; i < N; i++) {
#if USE_CPP_API
            nvcuda_namespace::memcpy_async(*((int4*)(shmem_warp_stream_ptr + SHMEM_STRIDE * i) + laneId),
                                            *((int4*)(src_gmem_warp_stream_ptr + GLOBAL_MEM_STRIDE * i) + laneId),
                                            pipe);
            pipe.commit();
#else
            __pipeline_memcpy_async((reinterpret_cast<int4*>(&shmem_warp_stream_ptr[(SHMEM_STRIDE * i)])) + laneId,
                                (reinterpret_cast<const int4*>(&src_gmem_warp_stream_ptr[(GLOBAL_MEM_STRIDE * i)])) + laneId,
                                sizeof(int4));
            __pipeline_commit();
#endif
        }

#if USE_CPP_API
        pipe.wait_prior<0>();
#else
        __pipeline_wait_prior(0);
#endif
        __syncthreads();

        // These fragments will accumulate the result of A and B matrix fragment multiplications
        // along the K_GLOBAL dimension.
        wmma::fragment<wmma::accumulator, M, N, K, float> c[WARP_COL_TILES][WARP_ROW_TILES];

        // Load the C matrix tiles into fragments from shared memory.
#pragma unroll
        for (int i = 0; i < WARP_COL_TILES; i++) {
#pragma unroll
            for (int j = 0; j < WARP_ROW_TILES; j++) {
                const float *tile_ptr = shmem_warp_tile_ptr + i * SHMEM_STRIDE * N + j * N;

                wmma::load_matrix_sync(c[i][j], tile_ptr, SHMEM_STRIDE, C_LAYOUT);
            }
        }

        __syncthreads();

        // Scale the C matrix.
#pragma unroll
       for (int i = 0; i < WARP_COL_TILES; i++) {
#pragma unroll
            for (int j = 0; j < WARP_ROW_TILES; j++) {
#pragma unroll
                for (int t = 0; t < c[i][j].num_elements; t++) {
                    c[i][j].x[t] *= beta;
                }
            }
        }

        // Select what warp copies what matrix to shared memory.
        // Warps 0-3 copy the A matrix, warps 4-7 copy the B matrix.
        const __nv_bfloat16 *warp_ptr = (warpId < (WARPS_PER_BLOCK/2)) ? (&A[block_tile_i * M * K_GLOBAL] + M * K_GLOBAL * (warpId % (WARPS_PER_BLOCK/2)) * 2) :
                                              (&B[block_tile_j * N * K_GLOBAL] + N * K_GLOBAL * (warpId % (WARPS_PER_BLOCK/2)) * 2);

        // Go through the global K dimension by a fixed step at a time.
#pragma unroll
        for (int tile_k = 0; tile_k < K_TILES; tile_k += CHUNK_K) {
            // Copy slices of the A and B matrices to shared memory.
            // The first half of the warps in the CTA copy the A matrix, the rest copy the B matrix.
            size_t shmem_idx = warpId < (WARPS_PER_BLOCK/2) ? (M * (warpId % (WARPS_PER_BLOCK/2)) * 2) : 
                                                              (N * (warpId % (WARPS_PER_BLOCK/2)) * 2 + shmem_idx_b_off);

            // First half of the warp copies the first row / column of the matrix,
            // the second half of the warp copies the next.
            const __nv_bfloat16 *lane_ptr = (warp_ptr + tile_k * K + (laneId / CHUNK_COPY_LINE_LANES) * K_GLOBAL);

            // Shift the second half of the warp to the next row / column in the shared memory.
            shmem_idx += laneId / CHUNK_COPY_LINE_LANES;

#pragma unroll
            for(int i = 0; i < ((WARP_SIZE/2) / CHUNK_COPY_LINES_PER_WARP) * 2; i++) {
                // Copy 16 bytes at once in each lane.
#if USE_CPP_API
                nvcuda_namespace::memcpy_async(*((int4*)&shmem[shmem_idx][0] + (laneId % CHUNK_COPY_LINE_LANES)),
                                                *((int4*)lane_ptr + (laneId % CHUNK_COPY_LINE_LANES)), pipe);
                pipe.commit();
#else
                __pipeline_memcpy_async((int4*)&shmem[shmem_idx][0] + (laneId % CHUNK_COPY_LINE_LANES),
                                        (int4*)lane_ptr + (laneId % CHUNK_COPY_LINE_LANES), sizeof(int4));
                __pipeline_commit();
#endif
                // Advance the global memory pointer and the shared memory index.
                lane_ptr = lane_ptr + K_GLOBAL * CHUNK_COPY_LINES_PER_WARP;
                shmem_idx += CHUNK_COPY_LINES_PER_WARP;
            }

#if USE_CPP_API
            pipe.wait_prior<0>();
#else
            __pipeline_wait_prior(0);
#endif
            __syncthreads();

            // Compute a grid of C matrix tiles in each warp.
#pragma unroll
            for (int k_step = 0; k_step < CHUNK_K; k_step++) {
                wmma::fragment<wmma::matrix_a, M, N, K, __nv_bfloat16, wmma::row_major> a[WARP_COL_TILES];
                wmma::fragment<wmma::matrix_b, M, N, K, __nv_bfloat16, wmma::col_major> b[WARP_ROW_TILES];

#pragma unroll
                for (int i = 0; i < WARP_COL_TILES; i++) {
                    size_t shmem_idx_a = (warpId / BLOCK_ROW_WARPS) * M * BLOCK_ROW_WARPS + (i * M);
                    const __nv_bfloat16 *tile_ptr = &shmem[shmem_idx_a][k_step * K];

                    wmma::load_matrix_sync(a[i], tile_ptr, K * CHUNK_K + SKEW_BF16);

#pragma unroll
                    for (int j = 0; j < WARP_ROW_TILES; j++) {
                        if (i == 0) {
                            // Load the B matrix fragment once, because it is going to be reused
                            // against the other A matrix fragments.
                            size_t shmem_idx_b = shmem_idx_b_off + (WARP_ROW_TILES * N) * (warpId%2) + (j * N);
                            const __nv_bfloat16 *tile_ptr = &shmem[shmem_idx_b][k_step * K];

                            wmma::load_matrix_sync(b[j], tile_ptr, K * CHUNK_K + SKEW_BF16);
                        }

                        wmma::mma_sync(c[i][j], a[i], b[j], c[i][j]);
                    }
                }
            }

            __syncthreads();
        }

        // Store the D fragments to shared memory.
#pragma unroll
        for (int i = 0; i < WARP_COL_TILES; i++) {
#pragma unroll
            for (int j = 0; j < WARP_ROW_TILES; j++) {
#pragma unroll
                // Uniform, point-wise transformations of ALL fragment elements by ALL threads in the
                // warp are well-defined even though element indices within fragment storage are not defined.
                for (int t = 0; t < c[i][j].num_elements; t++)
                    c[i][j].x[t] *= alpha;

                float *tile_ptr = shmem_warp_tile_ptr + i * SHMEM_STRIDE * K + j * N;

                wmma::store_matrix_sync(tile_ptr, c[i][j], SHMEM_STRIDE, C_LAYOUT);
            }
        }

        __syncthreads();

        // Now that shared memory contains all the D tiles, stream them to global memory.
        float *dst_gmem_warp_stream_ptr = &D[gmem_idx];

#pragma unroll
        for (int i = 0; i < N; i++) {
            *((int4*)(dst_gmem_warp_stream_ptr + GLOBAL_MEM_STRIDE * i) + laneId) =
                *((int4*)(shmem_warp_stream_ptr + SHMEM_STRIDE * i) + laneId);
        }

        __syncthreads();
    }
#endif
}

compute_bf16gemm_async_copy 是一个 CUDA 内核函数，用于执行异步复制版本的 BF16 GEMM（General Matrix Multiply）计算
根据代码片段的内容，可以看到该内核函数执行了以下操作：

声明了共享内存数组 shmem，用于存储片段数据。
根据线程的 warpId 和 laneId 计算 warp 的索引和 ID。
计算 A、B 和 C 矩阵片段在共享内存中的偏移和指针。
对 beta 进行调整，以便在每个矩阵块计算结束后乘以 alpha。
使用循环和异步内存拷贝操作，将 C 矩阵的多个片段从全局内存复制到共享内存中。
对共享内存中的 C 矩阵进行加载和缩放操作。
根据 warp 的索引，选择性地将 A 或 B 矩阵的片段从全局内存复制到共享内存中。
在 K 维度上循环迭代，将 A 和 B 矩阵片段加载到寄存器片段中，并执行矩阵乘法操作。
将结果片段存储到共享内存中。
将共享内存中的 D 矩阵片段流式传输到全局内存中。
等待所有线程完成。
该内核函数使用了CUDA的 wmma（Warp Matrix Multiply Accumulate）函数，用于实现矩阵乘法的并行计算。它利用共享内存和 warp 级别的数据交换来优化数据访问和计算效率。

float *shmem_warp_stream_ptr = (float*)&shmem[0][0] + warpId * SHMEM_STRIDE * N;

shmem_warp_stream_ptr 是一个指向共享内存中某个特定 warp 的数据流指针

这里的 &shmem[0][0] 是共享内存数组 shmem 的起始地址，(float*)&shmem[0][0] 将其强制转换为 float* 类型的指针。
warpId 是 warp 的索引，乘以 SHMEM_STRIDE * N 用于计算 warp 在共享内存中的偏移量。SHMEM_STRIDE 是共享内存的行跨度（stride），N 是矩阵的列数。
因此，shmem_warp_stream_ptr 指针将指向共享内存中属于特定 warp 的数据流的起始位置，以便在数据流操作中进行读写操作。

#define USE_CPP_API 0
beta /= alpha;
#if USE_CPP_API
    nvcuda_namespace::pipeline pipe;
#endif

beta /= alpha; 是一个除法运算，将 beta 的值除以 alpha 的值，并将结果赋值给 beta。

#if USE_CPP_API 是一个条件编译预处理指令，
用于根据宏定义#define USE_CPP_API 0
USE_CPP_API 宏被定义为0，因此条件 #if USE_CPP_API 不会满足，代码中的 nvcuda_namespace::pipeline pipe; 不会被编译进最终的代码。

如果 USE_CPP_API 宏被定义， USE_CPP_API 的值来选择编译不同的代码路径，进入条件编译块。
nvcuda_namespace::pipeline pipe; 是在条件编译块内声明了一个名为 pipe 的 nvcuda_namespace 命名空间下的 pipeline 对象。根据代码片段的上下文，pipeline 可能是一个用于管理异步内存操作的 CUDA 编程接口（CUDA API）的对象

需要注意的是，nvcuda_namespace 是一个占位符，实际的命名空间可能是 cuda 或其他与 CUDA 相关的命名空间

#pragma unroll
        for (int i = 0; i < N; i++) {
#if USE_CPP_API
            nvcuda_namespace::memcpy_async(*((int4*)(shmem_warp_stream_ptr + SHMEM_STRIDE * i) + laneId),
                                            *((int4*)(src_gmem_warp_stream_ptr + GLOBAL_MEM_STRIDE * i) + laneId),
                                            pipe);
            pipe.commit();
#else
            __pipeline_memcpy_async((reinterpret_cast<int4*>(&shmem_warp_stream_ptr[(SHMEM_STRIDE * i)])) + laneId,
                                (reinterpret_cast<const int4*>(&src_gmem_warp_stream_ptr[(GLOBAL_MEM_STRIDE * i)])) + laneId,
                                sizeof(int4));
            __pipeline_commit();
#endif
        }

根据给出的代码片段，#pragma unroll 是一个编译器指令，用于请求编译器对循环进行展开优化。循环中的代码将被展开多次，以减少循环开销并提高执行效率。

在代码片段中，#pragma unroll 后的循环指令会被编译器展开。循环的目的是将数据从全局内存 (src_gmem_warp_stream_ptr) 复制到共享内存 (shmem_warp_stream_ptr) 中的指定位置。具体来说，它将 int4 类型的数据块从全局内存复制到共享内存中。
#pragma unroll 是一个编译指示符，用于请求编译器对循环进行展开优化。

在代码中使用 #pragma unroll 可以提示编译器在编译时对紧接着的循环进行展开。循环展开是一种编译器优化技术，它将循环中的迭代次数事先展开为一系列重复的代码。这样可以减少循环开销，提高执行效率。

使用 #pragma unroll 可以向编译器提供循环展开的建议，但具体是否展开以及展开的程度还取决于编译器和编译器设置。在某些情况下，编译器可能会自动选择是否展开循环或选择适当的展开程度。

需要注意的是，使用 #pragma unroll 并不能保证循环一定会被展开，它只是向编译器发出一个建议。展开的效果和程度可能因编译器、编译选项和循环的特性而有所不同。

根据条件编译的情况，如果 USE_CPP_API 宏被定义，则使用了 nvcuda_namespace::memcpy_async 函数进行异步内存拷贝，并使用了 pipe.commit() 来提交异步操作。这可能是使用 CUDA C++ API 提供的异步内存拷贝功能。
如果 USE_CPP_API 宏未定义，则使用 __pipeline_memcpy_async 函数进行异步内存拷贝，并使用 __pipeline_commit() 提交异步操作。这可能是使用 CUDA 原生 API 提供的异步内存拷贝功能。

#include <iostream>

void foo()
{
    const int size = 4;
    int arr[size] = {1, 2, 3, 4};
    int sum = 0;

    // 使用 #pragma unroll 指示编译器对循环进行展开优化
    #pragma unroll
    for (int i = 0; i < size; i++)
    {
        sum += arr[i];
    }

    std::cout << "Sum: " << sum << std::endl;
}

int main()
{
    foo();
    return 0;
}

foo() 函数使用了 #pragma unroll 来指示编译器对循环进行展开优化。循环对数组中的元素求和，展开优化可以减少循环开销，提高计算效率。

代码片段8.1 compute_bf16gemm_async_copy 内核函数

#if USE_CPP_API
        pipe.wait_prior<0>();
#else
        __pipeline_wait_prior(0);
#endif
        __syncthreads();

        // These fragments will accumulate the result of A and B matrix fragment multiplications
        // along the K_GLOBAL dimension.
        wmma::fragment<wmma::accumulator, M, N, K, float> c[WARP_COL_TILES][WARP_ROW_TILES];

        // Load the C matrix tiles into fragments from shared memory.
#pragma unroll
        for (int i = 0; i < WARP_COL_TILES; i++) {
#pragma unroll
            for (int j = 0; j < WARP_ROW_TILES; j++) {
                const float *tile_ptr = shmem_warp_tile_ptr + i * SHMEM_STRIDE * N + j * N;

                wmma::load_matrix_sync(c[i][j], tile_ptr, SHMEM_STRIDE, C_LAYOUT);
            }
        }

        __syncthreads();

在共享内存中加载C矩阵的片段到WMMA（Warp级矩阵操作）的累加器片段中。

根据宏定义USE_CPP_API的值，代码中使用了条件编译。如果USE_CPP_API为真，则使用了CUDA C++ API中的异步流(nvcuda_namespace::pipeline)来同步内存拷贝操作。如果USE_CPP_API为假，则使用了CUDA C API中的异步流(__pipeline_wait_prior())来同步内存拷贝操作。
无论是使用CUDA C++ API还是CUDA C API，__syncthreads()都用于在CUDA线程块内进行同步，以确保共享内存中的数据加载操作完成后再进行后续操作。
之后，使用wmma::load_matrix_sync()函数从共享内存中加载C矩阵的片段到WMMA的累加器片段中。这个操作会将共享内存中的数据按矩阵布局加载到WMMA累加器片段中。
请注意，#pragma unroll 用于指示编译器对循环进行展开优化，以提高计算效率。在这里，循环用于迭代访问共享内存中的不同位置，并加载数据到WMMA的累加器片段中。展开循环可以减少循环开销和控制流分支，提高内存访问的连续性和计算性能。
在这里插入代码片

代码片段8.2 三个嵌套的循环使用#pragma unroll展开

#pragma unroll
       for (int i = 0; i < WARP_COL_TILES; i++) {
#pragma unroll
            for (int j = 0; j < WARP_ROW_TILES; j++) {
#pragma unroll
                for (int t = 0; t < c[i][j].num_elements; t++) {
                    c[i][j].x[t] *= beta;
                }
            }
        }

在这段代码中，三个嵌套的循环用于迭代访问 c 数组中的不同元素，并将其乘以变量 beta。通过使用 #pragma unroll，编译器可以在编译时将循环展开为多个迭代的序列，以减少循环开销和控制流分支。

展开循环可以提高计算效率，特别是在对小循环进行展开时，可以减少迭代次数和循环开销。展开循环还可以提高内存访问的连续性，从而进一步提高计算性能。

需要注意的是，展开循环可能会增加代码量和寄存器使用量，因此在使用 #pragma unroll 时需要权衡代码大小和性能之间的折衷。在实际应用中，可以根据具体情况进行优化，并根据编译器和目标硬件的特性进行测试和调整。

代码片段8.3 并行计算中将输入矩阵 A 和 B 的切片拷贝到共享内存中

 const __nv_bfloat16 *warp_ptr = (warpId < (WARPS_PER_BLOCK/2)) ? (&A[block_tile_i * M * K_GLOBAL] + M * K_GLOBAL * (warpId % (WARPS_PER_BLOCK/2)) * 2) :
                                              (&B[block_tile_j * N * K_GLOBAL] + N * K_GLOBAL * (warpId % (WARPS_PER_BLOCK/2)) * 2);

        // Go through the global K dimension by a fixed step at a time.
#pragma unroll
        for (int tile_k = 0; tile_k < K_TILES; tile_k += CHUNK_K) {
            // Copy slices of the A and B matrices to shared memory.
            // The first half of the warps in the CTA copy the A matrix, the rest copy the B matrix.
            size_t shmem_idx = warpId < (WARPS_PER_BLOCK/2) ? (M * (warpId % (WARPS_PER_BLOCK/2)) * 2) : 
                                                              (N * (warpId % (WARPS_PER_BLOCK/2)) * 2 + shmem_idx_b_off);

            // First half of the warp copies the first row / column of the matrix,
            // the second half of the warp copies the next.
            const __nv_bfloat16 *lane_ptr = (warp_ptr + tile_k * K + (laneId / CHUNK_COPY_LINE_LANES) * K_GLOBAL);

            // Shift the second half of the warp to the next row / column in the shared memory.
            shmem_idx += laneId / CHUNK_COPY_LINE_LANES;

#pragma unroll
            for(int i = 0; i < ((WARP_SIZE/2) / CHUNK_COPY_LINES_PER_WARP) * 2; i++) {
                // Copy 16 bytes at once in each lane.
#if USE_CPP_API
                nvcuda_namespace::memcpy_async(*((int4*)&shmem[shmem_idx][0] + (laneId % CHUNK_COPY_LINE_LANES)),
                                                *((int4*)lane_ptr + (laneId % CHUNK_COPY_LINE_LANES)), pipe);
                pipe.commit();
#else
                __pipeline_memcpy_async((int4*)&shmem[shmem_idx][0] + (laneId % CHUNK_COPY_LINE_LANES),
                                        (int4*)lane_ptr + (laneId % CHUNK_COPY_LINE_LANES), sizeof(int4));
                __pipeline_commit();
#endif
                // Advance the global memory pointer and the shared memory index.
                lane_ptr = lane_ptr + K_GLOBAL * CHUNK_COPY_LINES_PER_WARP;
                shmem_idx += CHUNK_COPY_LINES_PER_WARP;
            }

在并行计算中将输入矩阵 A 和 B 的切片拷贝到共享内存中。

在代码中，首先根据 warpId 的值判断当前 warp 是属于前半部分还是后半部分，然后分别选择从矩阵 A 或 B 中获取数据。根据 tile_k 的迭代，循环进行切片拷贝操作。如果当前 warp 是前半部分，那么从矩阵 A 中获取数据，否则从矩阵 B 中获取数据。
在拷贝过程中，使用 lane_ptr 来指向当前 warp 需要拷贝的数据片段，然后将该数据片段从全局内存复制到共享内存中。每个 warp 中的每个线程都通过 laneId 访问不同的数据片段，并通过 shmem_idx 计算在共享内存中的偏移位置。
根据编译时的宏定义 USE_CPP_API，选择使用不同的内存拷贝方式。如果 USE_CPP_API 定义为 1，则使用 nvcuda_namespace::memcpy_async 函数异步地进行内存拷贝，并通过 pipe 对象进行流水线操作。如果 USE_CPP_API 定义为 0，则使用 __pipeline_memcpy_async 函数进行内存拷贝，并通过 __pipeline_commit 等待拷贝操作完成。

代码片段8.4 在每个 warp 中执行并行的矩阵乘法操作，并将结果存储到共享内存中，然后再将共享内存中的数据流式传输到全局内存

#if USE_CPP_API
            pipe.wait_prior<0>();
#else
            __pipeline_wait_prior(0);
#endif
            __syncthreads();

            // Compute a grid of C matrix tiles in each warp.
#pragma unroll
            for (int k_step = 0; k_step < CHUNK_K; k_step++) {
                wmma::fragment<wmma::matrix_a, M, N, K, __nv_bfloat16, wmma::row_major> a[WARP_COL_TILES];
                wmma::fragment<wmma::matrix_b, M, N, K, __nv_bfloat16, wmma::col_major> b[WARP_ROW_TILES];

#pragma unroll
                for (int i = 0; i < WARP_COL_TILES; i++) {
                    size_t shmem_idx_a = (warpId / BLOCK_ROW_WARPS) * M * BLOCK_ROW_WARPS + (i * M);
                    const __nv_bfloat16 *tile_ptr = &shmem[shmem_idx_a][k_step * K];

                    wmma::load_matrix_sync(a[i], tile_ptr, K * CHUNK_K + SKEW_BF16);

#pragma unroll
                    for (int j = 0; j < WARP_ROW_TILES; j++) {
                        if (i == 0) {
                            // Load the B matrix fragment once, because it is going to be reused
                            // against the other A matrix fragments.
                            size_t shmem_idx_b = shmem_idx_b_off + (WARP_ROW_TILES * N) * (warpId%2) + (j * N);
                            const __nv_bfloat16 *tile_ptr = &shmem[shmem_idx_b][k_step * K];

                            wmma::load_matrix_sync(b[j], tile_ptr, K * CHUNK_K + SKEW_BF16);
                        }

                        wmma::mma_sync(c[i][j], a[i], b[j], c[i][j]);
                    }
                }
            }

            __syncthreads();
        }

        // Store the D fragments to shared memory.
#pragma unroll
        for (int i = 0; i < WARP_COL_TILES; i++) {
#pragma unroll
            for (int j = 0; j < WARP_ROW_TILES; j++) {
#pragma unroll
                // Uniform, point-wise transformations of ALL fragment elements by ALL threads in the
                // warp are well-defined even though element indices within fragment storage are not defined.
                for (int t = 0; t < c[i][j].num_elements; t++)
                    c[i][j].x[t] *= alpha;

                float *tile_ptr = shmem_warp_tile_ptr + i * SHMEM_STRIDE * K + j * N;

                wmma::store_matrix_sync(tile_ptr, c[i][j], SHMEM_STRIDE, C_LAYOUT);
            }
        }

        __syncthreads();

        // Now that shared memory contains all the D tiles, stream them to global memory.
        float *dst_gmem_warp_stream_ptr = &D[gmem_idx];

#pragma unroll
        for (int i = 0; i < N; i++) {
            *((int4*)(dst_gmem_warp_stream_ptr + GLOBAL_MEM_STRIDE * i) + laneId) =
                *((int4*)(shmem_warp_stream_ptr + SHMEM_STRIDE * i) + laneId);
        }

        __syncthreads();
    }
#endif

在并行计算中将计算结果从共享内存写回到全局内存的过程。

首先，根据编译时的宏定义 USE_CPP_API 的值，选择使用不同的同步机制。如果 USE_CPP_API 定义为 1，则使用 pipe.wait_prior<0>() 函数进行等待操作，等待之前的内存拷贝操作完成。如果 USE_CPP_API 定义为 0，则使用 __pipeline_wait_prior(0) 函数进行等待操作，等待之前的内存拷贝操作完成。然后通过 __syncthreads() 函数实现线程的同步，保证所有线程在继续执行之前都已经完成了等待操作。
接下来，在每个 warp 中计算 C 矩阵的片段。通过循环展开优化，迭代 k_step 进行计算。在每次迭代中，首先从共享内存中加载 A 矩阵片段到 a 变量中，然后根据 i 的值判断是否需要加载 B 矩阵片段到 b 变量中。接下来，通过 wmma::mma_sync 函数使用 wmma::load_matrix_sync 函数将数据加载到对应的 wmma::fragment 中
使用 wmma::mma_sync 函数执行矩阵乘法操作，将 A 和 B 矩阵块相乘，并累加到 C 矩阵块中。这一步骤通过嵌套的循环遍历 c 数组，对每个 C 矩阵块执行矩阵乘法操作。
之后，完成矩阵乘法后，将 C 矩阵块的结果存储回共享内存中。使用 wmma::store_matrix_sync 函数将结果存储到共享内存中的相应位置。
通过循环展开优化，将计算结果从 c 变量存储到共享内存中的 D 矩阵片段。最后，通过循环展开优化，通过流式传输的方式将共享内存中的 D 矩阵块传输回全局内存。使用 dst_gmem_warp_stream_ptr 指针和 shmem_warp_stream_ptr 指针将共享内存中的数据传输到全局内存。
最后，将共享内存中的 D 矩阵片段流式写回到全局内存中的 D 矩阵。

代码片段9 简单的 WMMA（Warp Matrix Multiply and Accumulate）矩阵乘法的 CUDA Kernel，用于在 CUDA 8.0 及更高版本的 GPU 架构上执行

__global__ void simple_wmma_bf16gemm(__nv_bfloat16 *a, __nv_bfloat16 *b, float *c, float *d, int m_ld, int n_ld, int k_ld, float alpha, float beta)
{
#if __CUDA_ARCH__ >= 800
   // Leading dimensions. Packed with no transpositions.
    int lda = k_ld;
    int ldb = k_ld;
    int ldc = n_ld;

   // Tile using a 2D grid
   int warpM = (blockIdx.x * blockDim.x + threadIdx.x) / warpSize;
   int warpN = (blockIdx.y * blockDim.y + threadIdx.y);
 
   // Declare the fragments
   wmma::fragment<wmma::matrix_a, M, N, K, __nv_bfloat16, wmma::row_major> a_frag;
   wmma::fragment<wmma::matrix_b, M, N, K, __nv_bfloat16, wmma::col_major> b_frag;
   wmma::fragment<wmma::accumulator, M, N, K, float> acc_frag;
   wmma::fragment<wmma::accumulator, M, N, K, float> c_frag;

   wmma::fill_fragment(acc_frag, 0.0f);

   // Loop over k
   for (int i = 0; i < k_ld; i += K) {
      int aCol = i; 
      int aRow = warpM * M;

      int bCol = i;
      int bRow = warpN * N;

      // Bounds checking
      if (aRow < m_ld && aCol < k_ld && bRow < k_ld && bCol < n_ld) {
         // Load the inputs
         wmma::load_matrix_sync(a_frag, a + aCol + aRow * lda, lda);
         wmma::load_matrix_sync(b_frag, b + bRow + bCol * ldb, ldb);
 
         // Perform the matrix multiplication
         wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);

      }
   }

   // Load in the current value of c, scale it by beta, and add this our result scaled by alpha
   int cCol = warpN * N;
   int cRow = warpM * M;

   if (cRow < m_ld && cCol < n_ld) {
      wmma::load_matrix_sync(c_frag, c + cCol + cRow * ldc, ldc, wmma::mem_row_major);

      for(int i=0; i < c_frag.num_elements; i++) {
         c_frag.x[i] = alpha * acc_frag.x[i] + beta * c_frag.x[i];
      }

      // Store the output
      wmma::store_matrix_sync(d + cCol + cRow * ldc, c_frag, ldc, wmma::mem_row_major);
   }
#endif
}