cuda编程

灰辉腾

已于 2024-03-20 16:37:04 修改

阅读量104

点赞数

文章标签：神经网络人工智能深度学习

于 2023-03-09 09:05:54 首次发布

本文链接：https://blog.csdn.net/qq_31279329/article/details/129393271

版权

主要学习核函数、线程布局、内存模型、流的使用

主要实现归约求和、放射变化、矩阵乘法、模型后处理

内存模型：pinned memory(cpu内存) global memory shared memory（GPU内存）

CPU有两个大类：pageable和page lock(vip房间)

房间不够用的时候pageable拿出来给别人用。容纳更多的数据。内存没有，但是跑更多的程序，性能降低。

pinned memory具有锁定性

pageable memory没有锁定性，可能数据不正确，但是提高程序运行数量

因为锁定性，所以GPU访问的是pinned memory

gpu直接访问pinned memory，叫DMA技术。

对GPU越近计算效率高

pinned memory<global memory <shared memory

流的使用

流是基于上下文之上的任务管理通道，一个contest可以创建多个流

流是异步控制

nullptr表示默认流

流就是：一个是管道，里面要执行的内容，任务队列（买西瓜），一个是要执行器（男朋友）

执行的代码加入流后，立即返回，不耽误时间。

带异步函数使用流，cudamemcpyAsync.

核函数：

__global__为核函数由host调用，定义在GPU上，可以在CPU上调用的函数；

__device__表示设备函数，由device调用，定义在GPU上，可以在CPU上调用的函数；

__host__为主机函数，由host调用，定义在GPU上，可以在CPU上调用的函数；

__share__表示变量为共享变量

host调用核函数，function<<<gridDim,blockDim,sharedMemorySize,stream>>>(args...)

gridDim,blockDim为多少个线程

cudaMalloc(**devPtr, byte_size)，在GPU上申请空间：

int *gpu_int;
cudaMalloc((void**)&gpu_int, sizeof(int))

申请sizeof(int)个尺寸的gpu

cudaMemset(*devptr, value, byte_size)，初始化

例如cudaMemset(gpu_int, 10, 10 * sizeof(int));

输出为10,10,10,10,10,10,10,10,10,10

cudaMemcpy(*dst, *src, byte_size, 类型)，GPU和CPU参数传递函数，有4中：

CPU2CPU：cudaMemcpyHostToHost

CPU2GPU：cudaMemcpyHostToDevice

GPU2CPU：cudaMemcpyDeviceToHost

GPU2GPU：cudaMemcpyDeviceToDevice

cudaMemcpy(cpu_int, gpu_int, 10 * sizeof(int), cudaMemcpyDeviceToHost);

核函数参数

jobs总共需要多少个线程

每个线程快中放多少个线程

有多少个块

int jobs = dst_height * dst_width;

int threads = 256;

int blocks = ceil(jobs / (float)threads);

warpaffine_kernel <<<blocks, threads>>> (img_buffer_device, src_width * 3, src_width,src_height, dst, dst_width,dst_height, 128, d2s, jobs);

int position = blockDim.x * blockIdx.x + threadIdx.x;

if (position >= edge) return;

灰辉腾

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫