CUDA | Writing and Compiling a CUDA Code (中英版)

1LOVESJohnny

已于 2024-02-05 00:58:49 修改

阅读量593

点赞数 7

分类专栏： CUDA 文章标签： CUDA c++ 学习

于 2024-01-31 23:48:59 首次发布

本文链接：https://blog.csdn.net/iLOVEJohnny/article/details/135964387

版权

CUDA 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

背景

CUDA和C++很像，只是多了一些扩展语法和功能。
同时也包括了C++的所有缺陷，段错误，还更难检测和debug出来。（233333）

下面是一个简单的示例：
– 如果我们要求2个向量的加和，则需要：
– 首先照CPU上声明2个向量
– 然后将它们移到GPU上
– 在GPU上将2个向量叠加
– 将它们传回CPU
– 在CPU上输出它们

代码的开始需要引用头文件：

#include <iostream>
#include <cuda.h>

Background

– CUDA is very similar to C++, with a few additions
– All the pitfalls, segmentation fault would remain in CUDA but more challenging to detect

Example: Vector addition
– takes two vectors on the CPU
– passes them to the GPU
– adds them on the GPU
– passes them back to the CPU
– outputs them on the CPU

The full code starts with:

#include <iostream>
#include <cuda.h>

内存管理

内存最好是在CPU上将GPU的内存进行分配。
在GPU上进行动态内存分配也是可行的，只是出于性能优化的角度考虑，不建议这么做。

下面来看内存分配的一个代码示例：需要指定内存的字节数

float *a, *b, *c;
cudaMalloc((void **) &a, N*sizeof(float));

给“a”分配一个GPU上的内存地址，作为一块大小为N x sizeof(float) 字节数的内存块，其中sizeof(float) 为4。

Memory

– Memory is best allocated on the GPU from the CPU
– Dynamic memory allocation is possible from the GPU, but not advisable for performance reasons

Allocating memory: input the number of bytes

float *a, *b, *c;
cudaMalloc((void **) &a, N*sizeof(float));

– Sets “a” equal to a memory address on the GPU that is the start of a block of memory of size N*sizeof(float) bytes.

拷贝数据

首先，在CPU上声明指针aHost，同时分配N个元素的内存空间。

float *aHost = new float [N];

接下来将数据从CPU拷贝到GPU上。

cudaMemcpy(a, aHost, N*sizeof(float), cudaMemcpyHosttoDevice);

– 上面的代码从aHost指针上拷贝了N x sizeof(float) 字节数的数据到GPU上的a指针指定的空间。
– 如果希望将数据拷贝回CPU，则使用“cudaMemcpyDevicetoHost”即可。

使用完GPU的一块内存后，需要释放指针的数据空间：

cudaFree(a);

– 以上代码将a指针指定的空间块释放，同时保留了a指针，以备后续分配新的GPU内存空间（通过调用cudaMalloc）。

Copying data

– On CPU, allocate spare as normal:

float *aHost = new float [N];

– Memory copy:

cudaMemcpy(a, aHost, N*sizeof(float), cudaMemcpyHosttoDevice);

– copies N*sizeof(float) bytes of data from aHost to a.
– In order to copy data back from the GPU, use: “cudaMemcpyDevicetoHost”.

– Freeing memory:

cudaFree(a);

– release the memory pointed to by a for later use by other cudaMalloc calls.

示例：GPU CUDA kernel实现向量叠加

GPU CUDA kernel即在GPU上用CUDA定义的函数。

__global void add(float* a, float* b, float* c, int N)
{
	int I = threadIdx.x;
    if (I < N)
    {
	    c[i] = a[i] + b[i];
    }
}

– GPU kernel核函数需要由_ _global_ _ 关键词指定。
– GPU kernel和函数需设置为void空返回值类型。
– 通常没有从核函数直接返回的信息，主要是因为GPU多线程异步调用。
– 线程数由参数threadIdx(.x, .y, .z)结构体指定。

Example: Vecotr addition - kernel

__global void add(float* a, float* b, float* c, int N)
{
	int I = threadIdx.x;
    if (I < N)
    {
	    c[i] = a[i] + b[i];
    }
}

– Kernel designated by _ _global_ _ keyword
– Kernel must have void return type
– No direct return of information possible from kernels (asynchronous execution)
– Thread number given by the struct threadIdx(.x, .y, .z)

加载/运行kernel

加载kernel（或者叫运行kernel）时，需要指定thread、block和grid的维度数值。
调用一个简单的kernel，可以通过以下代码：

const int N = 1024;
add<<<1, N>>>(...);

Launching kernel

– Kernel launches require thread-block and grid-dimension sizes to be specified

Call a simple kernel:

const int N = 1024;
add<<<1, N>>>(...);

错误处理

错误处理（或者debug）是保证CUDA代码逻辑正确的常用策略。
通常可以使用以下接口：

cudaGetLastError (void);

– 返回kernel运行到最后的error信息

或者希望输出更具体的错误信息，可以通过调用以下接口：

cudaGetErrorString();

– 该接口返回错误信息字符串。

Error handling

cudaGetLastError (void);

– returns last error, but also resets last error to cudaSuccess

cudaGetErrorString();

– returns an error message

示例：向量叠加 - 超大向量

一个线程block最多可以包含1024个线程，这些线程同时使用一个单独的SM硬件模块（streaming multi-processer）。为了对更大的数组进行并行计算（以及调用更多的SM模块），我们需要使用grid个线程block（thread-blocks）。以上可以使用blockIdx来索引当前grid中的block。

以下是一个简单的示例：

// within main function
int main()
{
    ...
    dim3 blocks ((int) ceil(N/1024.0));
    add<<<blocks, 1024>>> (a,b,c,N);
    ...
    return 0;
}

// within CUDA kernel
__global__ void kernel(args ...)
{
    // index the current thread of a certain block
    index = blockIdx.x * blockDim.x + threadIdx.x;
}

指定线程block和grid的常用技巧：
– 如何定义block和grid，通常和如何拆分数据相关。
– 比如grid中的单元（grid-cell）、每个线程，需要处理哪个多少个矩阵元素、数据值。

Vector addition - Larger vectors

– A thread block has a maximum size of 1024 threads and only uses a single SM
– To use larger arrays (and more SMs), we must use a grid of thread-blocks
– Use blockIdx containing index of current block within grid

A simple example:

// within main function
int main()
{
    ...
    dim3 blocks ((int) ceil(N/1024.0));
    add<<<blocks, 1024>>> (a,b,c,N);
    ...
    return 0;
}

// within CUDA kernel
__global__ void kernel(args ...)
{
    // index the current thread of a certain block
    index = blockIdx.x * blockDim.x + threadIdx.x;
}

General thread blocks and grids:
– how to decide which blocks and grids to use, how to devide the data:
– one grid-cell or matrix-element or data-point per thread (at least initially)

复杂的函数

对于复杂的函数逻辑，我们可能需要使用不同的function进行实现，则需要用到不同特性的函数定义：
– _ _device_ _ ：只运行在GPU上的函数
– _ _global_ _ ：同时可以运行在CPU 和 GPU 上的函数
其中，中GPU上运行的函数必须注明 _ _device_ _ 或 _ _global_ _ 关键词。任何_ _device_ _ 函数都可以被_ _global_ _函数调用。

Complex Function

For complicated functions of 2 vectors, we may want to use a separate function:
_ _device_ _ : GPU dedicated functions
_ _global_ _: CPU and GPU available functions

Functions to be run the GPU must have _ _device_ _ or _ _global_ _
Any depth of _ _device_ _ functions can be called within a _ _global_ _ function

线程限制

– 每个block中最多只能包含1024个thread；
– 在线程中，x/y的维度最大值为1024，这种情况下x=1024，y=1；
– block中的z维度最大值为1024；
– grid的维度最大是2^31 - 1个。

Thread limits

– Maximum 1024 threads perblock
– Maximum x/y dimension of block (in threads): 1024
– Maximum z dimension of block: 1024
– Maximum 2 ^ 31 - 1 for grid dims

Profiling / 剖析性能

CUDA程序计时（timing）通常需要注意，主要因为CUDA核函数是异步调用。
若简单实用CPU上的接口clock() 会进行错误的计时，如：漏掉很多耗时。

为了更准确地计时，CUDA引入events的概念。可通过调用以下接口来准确计时：

cudaEvent_t()
cudaEventrecord()
cudaEventSynchronize()
cudaEventElapsedTime()
cudaEventDestroy()

Profiling

– asynchronous calls are problematic with timing CUDA programs
– clock() may not give fine enough timings

– events, to find the time between them afterwards

cudaEvent_t()
cudaEventrecord()
cudaEventSynchronize()
cudaEventElapsedTime()
cudaEventDestroy()

CUDA拓展 - CPU host函数

– C++17代码规范需要被兼容；
在CPU上的函数需要有前缀 _ _host_ _；
只能由CPU host调用；
可以使用nvcc进行全部代码（global，device和host）的编译，而CPU编译器只能用来编译host代码。

CUDA extensions to C++ - Host functions

– C++ 17 code should be permitted
– Functions can be prefixed with _ _host_ _
– Callable on and by the host only
– use nvcc for all compilation: host compiler is used for host-only code

GPU 核函数 - 全局global函数

– 必须有前缀关键词 _ _global_ _
– 在device（GPU）上运行，可以在host（CPU）或者device（GPU）上调用
– 核函数的参数不能是引用（大部分是指针）
– 参数通常通过常量内存（constant memory）传入
– 返回类型必须是void
– 核函数的调用是异步的，即在整个GPU程序完全完成计算前，就会返回计算值

Kernel functions

– must be prefixed with _ _global_ _
– executed on device, callable from host or device
– parameters cannot be references
– parameters are passed via constant memory
– must have void return type
– call is asynchronous - returns before device has finished

GPU 核函数 - GPU device函数

– 使用 _ _device_ _前缀
– 符合所有C++代码规范
– 大部分C++17的特性都满足
– device代码只能在device（GPU）上运行和调用

Device functions

– prefixed with _ _device_ _
– all valid C++ code
– most C++17 features supported
– device code executed on device and callable from device only

device和host函数

函数可以同时加_ _device_ _ 和 _ _host_ _ 前缀，然后被同时编译在CPU和GPU上。

Functions

– Functions can be prefixed as both _ _device_ _ and _ _host_ _ and then are compiled for both CPU and GPU as necessary.

变量属性

_ _device_ _ 定义的一些规则：
– 编译在GPU上的global memory上，所有的核函数可以读或写。
– 在整个应用过程中都存在。

_ _constant_ _ 定义的一些规则：
– 编译在GPU的global memory上
– 核函数可以直接访问_ _constant_ _变量

_ _shared_ _ 定义的一些规则：
– 在thread block运行的SM（streaming multiprocessor）的共享内存中编译
– 在一个thread block的时间周期中存在
– 对同一个block的线程都共享同时可调用

Variable attributes

– _ _device_ _ defined outside functions:
– Resides in global memory on device (kernels can read/write)
– Lasts for whole application

– _ _constant_ _ defined outside functions
– Resides in global memory on device
– kernel functions can directly access _ _constant_ _

– _ _shared_ _:
– Resides in shared memory of Streaming Multiprocessor on which thread block is running
– Lasts for lifetime of thread block
– Shared and accessible for all threads in same block