cuda学习_WuYuFffan的博客-CSDN博客

cuda学习

文章平均质量分 79

cuda 从入门到如土，新手详细教程

文章数：8 文章阅读量：2584 文章收藏量：2

作者: WuYuFffan

这个作者很懒，什么都没留下…

展开

8. reduction sumation

8. reduction sumation GPU版累加：相邻两个元素相加 (ruduction) pesedu code: for(int offset=1; offset < blockdim.x; offset *=2) { if(tid%(2 * offset) == 0) { input[tid] += input[tid + offset]; } } 第一代中,offset为1，间隔为2： input[0] += input[0 + 1]; input[2] += in

原创 2020-09-19 21:39:34 · 157 阅读 · 0 评论
7. warp divergence

7. warp divergence 因为CUDA是SIMD架构，所以当一个cuda核执行选择分支时，其他非该分支的核会强制进入等待状态。 int tid = threadIdx.x; if (tid % 2 == 0) { //do something } else { //do something else } tid为奇数的设备执行if时，tid为偶数的设备拥塞。反之亦然。可见如果同一个warp中的Thread有很多分支,会导致warp divergence,这会严重降低程序的运

原创 2020-09-18 21:22:01 · 384 阅读 · 0 评论
6. cuda warp

7. cuda warp 在cuda中，线程块在单流多处理器上运行。当设备内存足够时，多个block可以在同一个sm上运行。 SIMT(Single instruction multiple threads): 一个指令多个线程执行(cuda的本质) 一个线程块不能再多个SM中执行。当一个SM中不能跑一个block的时候,（共享内存溢出时）, 内核发射失败，函数将返回 cudaSucess以外的值。程序结构对应的硬件结构: 为什么要有warp? 理论上线程并行和实际上的并行

原创 2020-09-18 19:33:53 · 453 阅读 · 0 评论
5. Device property查询

5. Device property查询在cuda编程中,要想编写出适合不同计算能力的并行程序,属性查询是必学的一部分。下表给出了cudaruntime.h中的动态查询属性: Property Explanation name descreption Major/minor 计算能力 5.2 -> 5/2 totalGlobalMem 总全局内存的大小 maxThreadsPerBlock 每个block的最大线程数 maxThreadsDim[3] block

原创 2020-09-18 15:21:13 · 186 阅读 · 0 评论
4. 给CUDA程序计时

4. 给CUDA程序计时通过做差的方法来实现 clock start = clock() Work load clock end = clock() difference = end - start time = (difference / clocks_per_sec) 注意：要根据实际程序的运行时间除以合理的数字给cpu计时: //summation in CPU clock_t cpu_start, cpu_end; cpu_start = clock(); sum_array

原创 2020-09-18 14:43:52 · 151 阅读 · 0 评论
3.cuda 异常捕获

3.cuda 异常捕获 Error分类: Compile time errors: 编译出错,在visual studio中代码一打错编译器就会提示这种错误。 Run time Error: 在一般的c++编程中,可以用 exception handling来抛出异常,并且用try 来捕获。 Error handling in CUDA cudaError cuda_function(…) return value: cudaSuccess if the kennel was launched s

原创 2020-09-18 14:43:40 · 768 阅读 · 0 评论
2. CUDA实例: 两个数组的相加

2. CUDA实例: 两个数组的相加 #mermaid-svg-1LkSMkqK3VtEsCqd .label{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);fill:#333;color:#333}#mermaid-svg-1LkSMkqK3VtEsCqd .label text{fill:#333}#mermaid-svg-1LkSMkqK3VtEsCqd .node rect,#mer

原创 2020-09-18 14:43:27 · 274 阅读 · 0 评论
1. CUDA内存传输

1. CUDA内存传输 cudaMemCpy（destination ptr, sourse ptr size in byte, direction）; 作用:把主机的数据传到设备端. cudaMalloc: ( (void**)destination ptr, size in byte); 作用: 在主机端分配内存 bite_size = size * sizeof(type): 其中size为数组的大小，type为数组的类型,bite_size即size in byte； sourse

原创 2020-09-18 14:42:48 · 211 阅读 · 0 评论

cuda学习

作者: WuYuFffan

8. reduction sumation

7. warp divergence

6. cuda warp

5. Device property查询

4. 给CUDA程序计时

3.cuda 异常捕获

2. CUDA实例: 两个数组的相加

1. CUDA内存传输