GPU programming notes

cuda tutorial

这是我个人学习过程中的一些笔记,菜鸟一枚。
抱歉所有的图都挂了,有空的时候慢慢恢复。
cuda programming guide
NVIDIA best practice guide
CS344 udacity website

  • got an overview on the usage of stereo camera based on Yingcai’s code

communication pattern

  • map
    一对一,每个pix执行同样的函数,比如×2
  • gather
    多对一,比如图像模糊
    在这里插入图片描述
  • scatter
    同一个线程尝试写到许多memory,一对多,可能冲突。
    在这里插入图片描述
  • stencil
    好像是前两个的结合,在临近点gather。
  • transpose

image.png

memery model

using global memery is much slower than using local and shared memeries.

  • varibles defined in kernel func are local varibles.
  • use shared memery:
    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QsYq76Sy-1583324831676)(https://i.postimg.cc/3rVP2k9f/image.png)]

put frequently used memory into shared memory!

syncronize

__shared__ int arry[128];
array[idx]=threadIdx;
// wait for every read op to compelete
__syncthreads();
int temp=array[idx+1];
// write from shared mem to local mem
__syncthreads();
array[idx]=temp;
__syncthreads();

atomic opp

  • is equivlent of adding variables in shared mem, but actually in global memery.
  • will slow down the app

g[i]=g[i]+1    ------>  atomicAdd(&g[i],1);

  • realize push_back in kernel func.

thrust

  • vector
thrust::device_vector<int> dv<100>;
thrust::host_vector<int> hv<100,25>;
// cudaMemcopy() in the background
dv=hv;
  • Note: device_vector.push_back() cannot be used in kernel function example

reduce

  1. add 0.5 back elements with 0.5 front elements in per block
  2. add 0.25 back elements with 0.25 front elements in per block
  3. until 1 element in every block

scan

  • Applications of Scan
  1. Stream Compaction
  2. Summed-Area Tables :
    A summed-area table (SAT) is a two-dimensional table generated from an input image in which each entry in the table stores the sum of all pixels between the entry location and the lower-left corner of the input image. often used when doing a box filter on image.

serialize implementation

  • inclusive
    for(int i = 0; i < ARRAY_SIZE; i++){
    	acc = acc + elements[i];
    	out[i] = acc;
    }

  • exclusive
    for(int i = 0; i < ARRAY_SIZE; i++){
        out[i] = acc;
    	acc = acc + elements[i];

    }

parallel implementation

Hillis and Steele

在这里插入图片描述

  • In practice , it should be excuted in the same block which has more threads than the number of array.
  • use double buffer, or the results of one warp will be overwritten by threads in another warp.
   __global__ void scan(float *g_odata, float *g_idata, int n)
{
  extern __shared__ float temp[]; // allocated on invocation
   int thid = threadIdx.x;
  int pout = 0, pin = 1;
  // Load input into shared memory.
   // This is exclusive scan, so shift right by one
   // and set first element to 0
  temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0;
  __syncthreads();
  for (int offset = 1; offset < n; offset *= 2)
  {
    pout = 1 - pout; // swap double buffer indices
    pin = 1 - pout;
    if (thid >= offset)
      temp[pout*n+thid] += temp[pin*n+thid - offset];
    else
      temp[pout*n+thid] = temp[pin*n+thid];
    __syncthreads();
  }
  g_odata[thid] = temp[pout*n+thid]; // write output
}
  • in/out: 2 buffers
  • d: step
  • offset=2^d
Blelloch
  • To do this we will use an algorithmic pattern that arises often in parallel computing: balanced trees. The idea is to build a balanced binary tree on the input data and sweep it to and from the root to compute the prefix sum. A binary tree with n leaves has d = log2 n levels, and each level d has 2 d nodes. If we perform one add per node, then we will perform O(n) adds on a single traversal of the tree.
  • The algorithm consists of two phases: the reduce phase (also known as the up-sweep phase) and the down-sweep phase.
upsweep

在这里插入图片描述

__global__ void Bscan(unsigned int *g_odata, int *g_idata, int n) {
  extern __shared__ int temp[]; // allocated on invocation
  int thid = threadIdx.x;
  int offset = 1;
  temp[2 * thid] = g_idata[2 * thid]; // load input into shared memory
  temp[2 * thid + 1] = g_idata[2 * thid + 1];
  for (int d = n >> 1; d > 0; d >>= 1) // build sum in place up the tree
  {
    __syncthreads();
    if (thid < d) {
      int ai = offset * (2 * thid + 1) - 1;
      int bi = offset * (2 * thid + 2) - 1;
      temp[bi] += temp[ai];
    }
    offset *= 2;
  }
  if (thid == 0) {
    temp[n - 1] = 0;
  }                              // clear the last element
  • size of temp: 2d+2(d-1)+…1=2^(d+1)=2n
  • store all intermedia values(binary tree)
down sweep

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LMt9zDhw-1583324721325)(https://i.postimg.cc/cHWQz1qx/image.png)]

  for (int d = 1; d < n; d *= 2) // traverse down tree & build scan
  {
    offset >>= 1;
    __syncthreads();
    if (thid < d) {
      int ai = offset * (2 * thid + 1) - 1;
      int bi = offset * (2 * thid + 2) - 1;
      float t = temp[ai];
      temp[ai] = temp[bi];
      temp[bi] += t;
    }
  }
  __syncthreads();
  g_odata[2 * thid] = temp[2 * thid]; // write results to device memory
  g_odata[2 * thid + 1] = temp[2 * thid + 1];

segment scan

一个长array分成许多小array,然后在里面scan.
用在稀疏矩阵乘法,稠密向量乘法。
page rank: n*n,有链接的才不是0
这时,就可以使用segment scan将稀疏矩阵表示成CSR format,
在这里插入图片描述

sort

odd-even sort

parallel version of bubble sort.

  • work: O(n^2)
  • step: O(n)

merge sort

归并排序, 3 main stages
Stage 2 use 1 thread-block.
Note that in stage 3, only 1 thread works and lots of SM will be idle. So we break 2 list into sub-lists to achieve parallesm.

  • work: O(n*logn)
  • step: O(logn)
    在这里插入图片描述

sorting networks

bitonic sort双调排序,计算量oblivious to input content. 无论是random, sorted , reverted array,都一样。
在这里插入图片描述

radix sort

从LSB(最小端)开始,把0放到前面,1放到后面,一直到最高位比较完。
每一位的比较,其实用的是compact, 也就是scan。
在这里插入图片描述

quick sort

本质是一个递归算法,先取pivot,分成<,=,>三个array,然后在三个array中继续取pivot。以下是非递归实现。可以用dynamic parallism来递归。
在这里插入图片描述

optimization

几个决定速度的方面:

  1. 算法,用算术复杂度来衡量
  2. 基本原则,cache-aware实现
  3. 基于平台架构的优化;
  4. 小优化,比如快速逆平方根
    在这里插入图片描述

分析

首先分析自己代码是否利用了bandwidth。
使用device query分析,可以计算出GPU的带宽(时钟和bus),看看自己写的kernel是否完全利用了带宽。
在这里插入图片描述
这很可能是coalesced的缘故.coalesce也就是,threadIdx.x相邻的线程应该访问相邻的元素,否则如果跨度大的话,memory transaction中很大一部分就浪费。、

  • LITTLE‘S LAW
    在这里插入图片描述
    在这里插入图片描述
  • 有sharemem时候,降低latency的方法:
    在这里插入图片描述
    第二个是因为,如果3232个thread,可能会有很多线程等待其他线程,改成1616就会好很多。
    第四个是因为,一个SM有很多block,一个block在等sycthread(),其他的block可以行动。

thread divergence

一个warp中thread因为if\else而异步进行。慢了多少主要看一个warp被分成了多少份。
在这里插入图片描述

一些ninja method

  1. 数字默认double,所以后面加上f会快,比如2.5f
  2. 使用intrinsic
    在这里插入图片描述
device query

5.14. 看看你硬件
./deviceQuery
1060: 10 SM* 192, 1024 threads/1 block
2060: 30 SM* 64, 1024 threads/1 block
TX2: 2 SM*64, 1024
XAVIER NX: 6SM * 64, 1024
在这里插入图片描述

pined memory

使用cudaHostMalloc可以让CPU到GPU的copy更快。可以用于hash-table的streaming。
在这里插入图片描述

stream

让不同的kernel同时运行。两个kernel没有相互依赖关系才可以。
在这里插入图片描述
下图也是同时的,s1和s2互不相干;
在这里插入图片描述
注意不要出现这种冲突的情况
在这里插入图片描述
stream的主要作用在,如果有一大陀数据,没法在一个kernel里全跑完,那就一小块一小块的考,比如一半在copy一半在process.让data transfering和processing同时进行。
在这里插入图片描述

list ranking

就是把一个linked list变成array,给每个元素标号。
用更多的work(n* logn)来换更少的step(log n)。
本质思想是从找linked list最后一个元素来的。 每一个elem都找最后一个元素,然后从0开始wake, 一层一层wake。比如我们先wake 5
在这里插入图片描述
在这里插入图片描述

cuckoo hashing

chaining is bad for parallel.
kicking out things that already in the hash table.
在这里插入图片描述
有一定几率,每一个hash function都试过以后放不进去任何hash table。一定iteration之后,只能更换hash function了。
在lookup的时候,可能要把每一个hash function都试一遍。

  • 注意:
    write in和Kick out操作,需要atomic operation(AtomicExch)

dynamic parallelism

在这里插入图片描述
让递归和nested 成为可能!

注意点
  1. block里面每一个thread都会launch一个child block,可以使用threadIdx.x来限制;
  2. stream, event都只属于某个block,不能把他们pass到其他block或者子block。我还不懂,要看看lesson5。
  3. shared memory也是private的,没法pass给 child block。child block在另一个grid里面!
    在这里插入图片描述
    第一次知道kernel里面还能malloc()…
  • quicksort的痛点和bfs是一样的!
  1. 每次执行完一个kernel,都需要把gpu信息(output_len, is_change)传到cpu
  2. wave形式,wave短的要等wave长的
    在这里插入图片描述
    结合cuda stream:
    在这里插入图片描述

matrix multiply

稀疏矩阵使用CSR格式。这里的x是一个列向量。
在这里插入图片描述

cudaMallocPitch and cudaMemcpy2D

When accessing 2D arrays in CUDA, memory transactions are much faster if each row is properly aligned…

Assuming that we want to allocate a 2D padded array of floating point (single precision) elements:

cudaMallocPitch(&devPtr, &devPitch, Ncols * sizeof(float), Nrows);

where

  • devPtr is an output pointer to float (float *devPtr);
  • devPitch is a size_t output variable denoting the length, in bytes, of the padded row;
  • Nrows and Ncols are size_t input variables representing the matrix size.

cudaMallocPitch will allocate a memory space of size, in bytes, equal to Nows * pitch. However, only the first Ncols * sizeof(float) bytes of each row will contain the matrix data.
Accordingly, cudaMallocPitch consumes more memory than strictly necessary for the 2D matrix storage, but this is returned in more efficient memory accesses.

CUDA provides also the cudaMemcpy2D function to copy data from/to host memory space to/from device memory space allocated with cudaMallocPitch.

cudaMemcpy2D(devPtr, devPitch, hostPtr, hostPitch, Ncols * sizeof(float), Nrows, cudaMemcpyHostToDevice)

where

  • devPtr and hostPtr are input pointers to float (float *devPtr and float *hostPtr) pointing to the - (source) device and (destination) host memory spaces, respectively;
  • devPitch and hostPitch are size_t input variables denoting the length, in bytes, of the padded rows for the device and host memory spaces, respectively;
  • Nrows and Ncols are size_t input variables representing the matrix size.

other operators

__ldg

optimize by using read-only cache.

在这里插入图片描述
refer: cuda sheet

cudaMemcopyAsyc

make use of stream. Copy engine and kernel engine can work cuncurrently.

cudaMallocManaged

data on host and device can share same pointer. May be slower than cudaMalloc.

ballot, bfind

cuda programming guide翻译

  • create a bit mask in a 32 bits register using the GPU ballot instruction.
  • use the bfind PTX intrinsic to get the location of the first nonzero bit

how to optimize

warp

32 thread forms a warp. do computation concurrently in physical.

  • How multiple warp parallize?
    use computation(green) to hide latency(white).
    在这里插入图片描述
memory access pattern

在这里插入图片描述

在这里插入图片描述
so voxel hashing is not a good access pattern.

  1. voxel is not in native word length
  2. not aligned, not coalesced. (random.
    在这里插入图片描述
share mem
bank conflict

thread根据横着的id来分warp,0–31是warp 1,32–65是warp2.
share mem根据2d的id来分bank,一共有32个bank,对应thread的一个warp。横着的id<31时,id每次+1,那么bank就+1。这样设计是因为,同一个warp里的thread访问到同一个bank里不同地址,就会conflict。
万一conflict怎么办呢?
可以在sharemem最右边pad一个column,这个column纯粹是占位用的,不参加IO。这样同一个warp就可以错开来访问不同的bank。具体参考共享内存csdn
在这里插入图片描述

GDB, CUDA-GDB, CUDA MEMCHECK

enter gdb\ cuda-gdb

  • if debug common cpp project:
    if excutable binary is called “detect” after compile, just type
gdb ./detect

in terminal.

to run a node in cuda-gdb:

rosrun --prefix "cuda-gdb --args" edt edt_node 

to run a node in cuda-gdb:

roslaunch --prefix "gdb --args" edt/launch edt.launch

or you can write in launch file:

launch-prefix= "xterm -e gdb --args"
  • Caution:
  • to use gdb, please
set(ROS_BUILD_TYPE Debug)
set(CMAKE_BUILD_TYPE Debug)

in CmakeLists.

  • when debuging cuda using cuda-gdb, you should pass debug info to nvcc compiler as well.
SET(CUDA_NVCC_FLAGS "-g ;-G ;-arch=sm_60" CACHE STRING "nvcc flags" FORCE)

enter cuda memcheck

  • standalone:
rosrun --prefix "cuda-memcheck " edt edt_node

somehow, I cannot set params like --continue and --leackcheck .

  • intergrite with cuda-gdb
(cuda-gdb) set cuda memcheck on

cmds

gdb quick start

set breakpoints

  • break main
  • b main.cpp:14
  • b kernel.cu:58 if threadIdx.x==8
    It is a conditional breakpoint.

disable breakpoint

  • disable
  • delete breakpoints

watch varibles

  • p (var): print value
  • p a=1 : set a=1
  • info locals : print all local vars
  • p *array@len

other

  • n: next
  • l: list code
  • [Enter]: repeat the last md
  • q: quit
  • r:run or restart
  • c: continue
  • s: step one excution. if it is function, then step into.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值