Study Note: Global memory optimisation of CUDA programming

Global memory storage pattern

The storage pattern of global memory in GPU is row first pattern because there is not two dimension array in GPU. Use a matrix as an example[1]: 



Knowledge of warp in CUDA[1]: 


Thread blocks are partitioned into warps based on thread indices. If a thread block is organised into a one-dimensional array (i.e., only threadIdx.x is used), then the partition is straightforward. The threadIdx.x values within a warp are consecutive and increasing. For a warp size of 32, warp 0 starts with thread 0 and ends with thread 31, warp 1 starts with thread 32 and ends with thread 63. In general, warp n starts with thread 32*n and ends with thread 32(n + 1) – 1. For a block whose size is not a multiple of 32, the last warp will be padded with extra threads to fill up the 32 threads; for example, if a block has 48 threads, it will be partitioned into 2 warps, and its warp 1 will be padded with 16 extra threads. (Padding is not a good thing in CUDA programming).


If the threadIdx has two dimensions, then a warp consists of the consecutive threadId.x first. If there is more space, then use the consecutive threadId.y. 



Global memory coalescing (good for matrix_multiplication, which represents the cases that loading data is one dimension): 


When you are considering to communicate with the global memory, you need to consider the memory coalescing to increase the efficiency. Because the global memory access pattern always be a trunk of 32B/64B/128B. Once you want to load data from global memory, if the data you require is less than 32B, it will waste some bandwidth of the global memory. 


The only way to make use of this feature is to let the consecutive threadIdx.x's threads (threads within a warp specific speaking) to load consecutive address of global memory. Then computer will achieve memory coalescing. 


Physically speaking, the actual policy of CUDA is half-warp policy. Every schedule of GPU is half-warp which consists of 16 threads. Therefore, we only need to guarantee 16 threads are reading a consecutive address of global memory. 


Matrix Multiplication Example: 


Technically speaking, in order to achieve the global memory coalescing, we only need to get the data from global memory in a way of 1 thread per column of the same row. However, when we perform matrix multiplication, we must access the data from global memory in an inefficient way (when we loading a column of matrix, different threads load different rows).  We cannot avoid this, however we can use blocking to use shared memory to make up for this. Shared memory doesn't need to consider the memory coalescing because the delay and bandwidth are good enough for shared memory. As we use blocking, we can assure the efficient access to global memory when we load the data from global memory to shared memory. Once the data is loaded into shared memory, we don't need to consider memory coalescing any more. 


/*------matrix mul----------*/

__global__ void MatrixMulKernal(float * d_M, float* d_N, float* d_R, int width){
	// create the shared memory (like the cudaMalloc step)
	__shared__ float ms[tile_Width][tile_Width];
	__shared__ float ns[tile_Width][tile_Width];

	// get the intrinsic varible of thread 
	int blockX = blockIdx.x;
	int blockY = blockIdx.y;
	int threadX = threadIdx.x;
	int threadY = threadIdx.y;

	/*-----transfer the data from global memory to shared memory-----*/

	// d_M need to use blockId/threadId to calculate the row 
	int row = blockY * tile_Width + threadY;
	// d_N need to use blockId/threadId to calculate the conlumn
    int conlumn = blockX * tile_Width + threadX;

    // fill each shared memory of each block

    for (int i=0; i<width/tile_Width; i++){
    	ms[threadY][threadX] = d_M[row*width+i*tile_Width+threadX];
    	ns[threadY][threadX] = d_N[(i*tile_Width+threadY)*width+conlumn]; 
    	__synchthreads();
    }
    // create a register varibale
    float rValue = 0;

    for (int k=0; k<tile_Width; k++){
    	rValue += ms[threadY][k] * ns[k][threadX];
    	__synchthreads();
    }

    d_R[row*width+conlumn] = rValue;


}

In general speaking [2]: 




代码理解:如同不分块的矩阵乘法,int row是目标矩阵的元素的行坐标,int column是目标矩阵的元素的列坐标。而填充共享内存的时候,参照矩阵乘法,M矩阵在同一行的不同列,所以不同列是threadIdx.x。而N矩阵是同一列的不同行,所以threadIdx.y在行坐标。


Since every warp firstly contains the same threadIdx.x. Therefore, no matter in the access of d_M or d_N, it will satisfy the global memory coalescing. 


Global memory storage SOA vs AOS schema (good for kmeans, which represents the cases that the loading data is two dimension): 


These two schemas have their own benefits. It depends on the situation of application. 


Like the example of calculating euclidean distance: 


When we are calculating a big set of points to the original point distance, it usually better to choose the SOA(structure of array). It means that the same kind of coordinate should be stored in consecutive address of global memory. Like this[2]:


 


Because it can utilise the number of threads available to calculate different points in parallel. At one schedule, take the x coordinates in the same time. Then, in the next time, take the y coordinates in the same time. 


However, if we only calculate a few points of a big set, it will be better that we use the AOS(array of struct) [2]:




Because in this way, it doesn't need to go over as many variables that are un-necessary as SOA.


Global memory pre-fetching (Double buffering)


When we look deeper in the matrix_mul code, you can see the transfer from global memory to shared memory was done by this sentence: 


ms[threadY][threadX] = d_M[row*width+i*tile_Width+threadX];

Actually, it contains two machine command. First, copy the value of d_M(row*width+i*tile_Width+threadX) to register. Then load the value from register to shared memory. There is not other instructions between them. Notice that there is also a barrier synchronisation in the next sentence: 


__synchthreads();

Therefore, every block must wait until all its threads done the transfer. The block that loads its current tile will need to wait a long time before it can compute their current tile. It will be a waste of the floating-points computation unit. 


Therefore, here introduce a way called pre-fetch or double buffering. This mean use double amount of register and shared memory; however, it does solve the overhead problem I mentioned above[1]. 


 

Notice that it breaks the transfer part into two actual steps. And threads don't need to wait their companions of one warp to finish loading data from global memory to register. They can directly go to compute current tile once they finish their own transfer. 


And it is very fast to deposit the data from register to shared memory. However, this mean's overhead is it uses double amount of register and shared memory. Using additional registers can reduce the number of blocks that can run on an SM; however, this technique can still win if it significantly reduces the amount of time each thread waits for its global memory load data[1].


Reference: 

[1] Programming.Massively.Parallel.Processors.A.Hands-on.Approach Kirk,.Hwu

[2] CMU 18645 How to write fast code Jike Chong and Ian Lane

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
基于YOLOv9实现工业布匹缺陷(破洞、污渍)检测系统python源码+详细运行教程+训练好的模型+评估 【使用教程】 一、环境配置 1、建议下载anaconda和pycharm 在anaconda中配置好环境,然后直接导入到pycharm中,在pycharm中运行项目 anaconda和pycharm安装及环境配置参考网上博客,有很多博主介绍 2、在anacodna中安装requirements.txt中的软件包 命令为:pip install -r requirements.txt 或者改成清华源后再执行以上命令,这样安装要快一些 软件包都安装成功后才算成功 3、安装好软件包后,把anaconda中对应的python导入到pycharm中即可(不难,参考网上博客) 二、环境配置好后,开始训练(也可以训练自己数据集) 1、数据集准备 需要准备yolo格式的目标检测数据集,如果不清楚yolo数据集格式,或者有其他数据训练需求,请看博主yolo格式各种数据集集合链接:https://blog.csdn.net/DeepLearning_/article/details/127276492 里面涵盖了上百种yolo数据集,且在不断更新,基本都是实际项目使用。来自于网上收集、实际场景采集制作等,自己使用labelimg标注工具标注的。数据集质量绝对有保证! 本项目所使用的数据集,见csdn该资源下载页面中的介绍栏,里面有对应的下载链接,下载后可直接使用。 2、数据准备好,开始修改配置文件 参考代码中data文件夹下的banana_ripe.yaml,可以自己新建一个不同名称的yaml文件 train:训练集的图片路径 val:验证集的图片路径 names: 0: very-ripe 类别1 1: immature 类别2 2: mid-ripe 类别3 格式按照banana_ripe.yaml照葫芦画瓢就行,不需要过多参考网上的 3、修改train_dual.py中的配置参数,开始训练模型 方式一: 修改点: a.--weights参数,填入'yolov9-s.pt',博主训练的是yolov9-s,根据自己需求可自定义 b.--cfg参数,填入 models/detect/yolov9-c.yaml c.--data参数,填入data/banana_ripe.yaml,可自定义自己的yaml路径 d.--hyp参数,填入hyp.scratch-high.yaml e.--epochs参数,填入100或者200都行,根据自己的数据集可改 f.--batch-size参数,根据自己的电脑性能(显存大小)自定义修改 g.--device参数,一张显卡的话,就填0。没显卡,使用cpu训练,就填cpu h.--close-mosaic参数,填入15 以上修改好,直接pycharm中运行train_dual.py开始训练 方式二: 命令行方式,在pycharm中的终端窗口输入如下命令,可根据自己情况修改参数 官方示例:python train_dual.py --workers 8 --device 0 --batch 16 --data data/coco.yaml --img 640 --cfg models/detect/yolov9-c.yaml --weights '' --name yolov9-c --hyp hyp.scratch-high.yaml --min-items 0 --epochs 500 --close-mosaic 15 训练完会在runs/train文件下生成对应的训练文件及模型,后续测试可以拿来用。 三、测试 1、训练完,测试 修改detect_dual.py中的参数 --weights,改成上面训练得到的best.pt对应的路径 --source,需要测试的数据图片存放的位置,代码中的test_imgs --conf-thres,置信度阈值,自定义修改 --iou-thres,iou阈值,自定义修改 其他默认即可 pycharm中运行detect_dual.py 在runs/detect文件夹下存放检测结果图片或者视频 【特别说明】 *项目内容完全原创,请勿对项目进行外传,或者进行违法等商业行为! 【备注】 1、该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的,请放心下载使用!有问题请及时沟通交流。 2、适用人群:计算机相关专业(如计科、信息安全、数据科学与大数据技术、人工智能、通信、物联网、自动化、电子信息等)在校学生、专业老师或者企业员工下载使用。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值