CUDA学习系列之cuda-runtime-api(三)

最新推荐文章于 2023-03-11 12:59:53 发布

AI_央木

最新推荐文章于 2023-03-11 12:59:53 发布

阅读量363

点赞数

文章标签：学习算法

本文链接：https://blog.csdn.net/m0_46464037/article/details/126594548

版权

1. 计算两个向量的加法

int main(){

    const size = 3;
    float vector_a[size] = {2, 3, 2};
    float vector_b[size] = {5, 3, 3};
    float vector_c[size] = {0};

    float* vector_a_device = nullptr;
    float* vector_b_device = nullptr;
    float* vector_c_device = nullptr;

    checkRuntime(cudaMalloc(&vector_a_device, size * sizeof(float)));
    checkRuntime(cudaMalloc(&vector_b_device, size * sizeof(float)));
    checkRuntime(cudaMalloc(&vector_c_device, size * sizeof(float)));

    checkRuntime(cudaMemcpy(vector_a_device, vector_a, size * sizeof(float), cudaMemcpyHostToDevice));
    checkRuntime(cudaMemcpy(vector_b_device, vector_b, size * sizeof(float), cudaMemcpyHostToDevice));

    vector_add(vector_a_device, vector_b_device, vector_c_device, size);
    checkRuntime(cudaMemcpy(vector_c, vector_c_device, size * sizeof(float), cudaMemcpyDeviceToHost));
    

}

__global__ void vector_add_kernel(const float* a, const float* b, float* c, int ndata){

    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if(idx >= ndata) return;
   
    c[idx] = a[idx] + b[idx];
}

void vector_add(const float* a, const float* b, float* c, int ndata){

    const int nthreads = 512;
    int block_size = ndata < nthreads ? ndata : nthreads;  // 如果ndata < nthreads 那block_size = ndata就够了
    int grid_size = (ndata + block_size - 1) / block_size; // 其含义是我需要多少个blocks可以处理完所有的任务
    printf("block_size = %d, grid_size = %d\n", block_size, grid_size);
    vector_add_kernel<<<grid_size, block_size, 0, nullptr>>>(a, b, c, ndata);

}

block_size = 3, grid_size = 1
vector_c[0] = 7.000000
vector_c[1] = 6.000000
vector_c[2] = 5.000000

这里主要想记录下为什么grid_size求值方法是这样：

grid_size = (ndata + block_size - 1) / block_size

假设图像400x900，选择一个大小为(32,32)的块。然后，图像x和y尺寸的块数应为400/32和900/32。但是，这两个图像尺寸都不是相应块尺寸的整数倍，因此，由于整数除法，我们最终将创建尺寸为12 x 28的网格，这将导致线程总数等于384 x896。(因为32 x 12 = 384和32 x 28 = 896)。

正如我们所看到的，每个维度中的线程总数小于相应的图像维度。我们需要做的是舍入块的数量，以便如果图像尺寸不是块尺寸的倍数，我们创建一个额外的块，它将覆盖剩余的像素。

以下是两种方法。

代替整数除法来计算块数，我们使用浮点除法和ceil结果。

int image_width = 400;
int image_height = 900;
dim3 block(32,32);
dim3 grid;
grid.x = ceil( float(image_width)/block.x );
grid.y = ceil( float(image_height)/block.y );

另一个聪明的方法是使用以下公式

int image_width = 400;
int image_height = 900;
dim3 block(32,32);
dim3 grid;
grid.x = (image_width + block.x - 1 )/block.x;
grid.y = (image_height + block.y - 1 )/block.y;

当以上述方式创建网格时，最终将创建一个大小为13 x 29的网格，这将导致线程总数等于416 x 928。

现在在这种情况下，我们在每个维度上的线程总数大于相应的图像维度。这将导致某些线程在映像范围之外访问内存，从而导致未定义的行为。解决此问题的方法是，我们在内核内部执行绑定检查，并且仅对落入图像范围内的那些线程进行处理。当然，我们需要将图像尺寸作为参数传递给内核。以下样本内核显示了此过程。