Tips for speed up your algorithm in the CUDA programming

转载 2013年12月03日 10:01:29
There are a couple of things you can do to speed up your algorithm in the CUDA programming : 

1.)Try to attain .75(75%) to 1 (100%) occupancy of every kernel execution.

This can be ensured by optimizing the number of resisters used by the Kernal and number of threads per block. We need to figure out the optimum register count per thread for the target device.

2.) Avoid host to device and device to host memory transfers.Try to minimize Memory fetch operations so that the local cache need not be refreshed frequently.host to device data transfer bandwidth is 4 GB/s and divice to device data transfer bandwidth is 76.5 GB/s.So, do more computation on GPU rather then transfer data to and fro device to host.

 3.) Store runtime variables in registers for fastest execution of instructions.

4.) Do not use local arrays in your code like int a[3]={1,2,3} - better use variables such as a0=1;a1=.. etc if possible. Variables and array of size 4 and less are stored in register (by deault).

5.) Write simple and small kernels. Kernel launch cost is negligible( 5 us).

If you have one large Kernel, try to split it up into multiple small ones - it might be faster due to less registers used. For small kerenels we get much resources(registers, shared memory,constant memory, etc.) because it is limited to each kernel.

6.) Texture reads are cached where as global memory reads are not. Use textures to store your data where possible.

7.)Prevent threads from diverging, Conditional jumps should branch equal for all threads.Try to give conditional branching depands on multiple of wrap size.

8.) Avoid loops which are run only by a minority of threads while the others are idle. All the threads in a block wait for every thread in that block to finish.

9.) Use fast math routines where possible.

- e.g. __mul24(), __mul32(), __sin(), etc.

10.) A complex calculation often is faster than a large lookup table so recalculation is better than caching. Remember Memory transfers can be slow.

11.) Writing your own cache manager that uses shared memory for caching might not be an advantage.

12.) For fastest global memory access, use coalescence of global memory.

13.) Try to avoid multiple threads accessing the same memory element,sothat no bank conflict is possible on shared and constant memory.

14.) Try to avoid bank conflicts for reading memories.To achive high memory bandwidth, shared and constant memories are devided into equally sized memory modules, called banks.

And each thread of half wrap access elements of different banks to avoid bank conflict.

15.) Small lookup tables can be stored in shared mem.If this small lookup table is used by all threads of the same block then small data should be cached so that it can be accessed faster.

16.) Experiment with the number of parallel threads to find the optimum. Exprementation always gives better performance improvement in CUDA.e.g. number of threads per blocks(192, 256 etc.)

17.) Using parallelism efficiently Partitioning our Computations so that our GPU multiprocessors should be equally busy.

18.) Keeping our resource usage low .Resource usages should be low enough to support multiple active thread blocks per multiprocessor.

 

19.) Thumb rule for Grid/Block size :

# of blocks > # of multiprocessors

-So all multiprocessors have at least one block to execute.

# of block > 2 * # of multiprocessors

-Multiple blocks can run concurrently in a multiprocessor.

 

20.) Compile with –ptxas-options=-v flag

This will give various information at the time of compilation e.g.

- Page size

- Total allocated memory

- Total available memory

- Nrof small block pages

- Nrof large block pages

- Longest free list size

- Average free list size


Related Tags:

CUDA

Author: Abhinav Kumar

Keil uVision5 下载程序 add flash programming algorithm选项缺少需要的算法解决办法

MDK开发环境从V4升级到V5后,支持包不再是集成到开发环境当中,而是封装在PACK中,需要自行安装,比较麻烦。 搭建MDK开发环境以及破解的方法,在前面的文章中有详细说明,这里不再赘述,有兴趣...
  • u010893262
  • u010893262
  • 2017年04月27日 11:41
  • 2560

[原创] Keil uVision5 下载程序 add flash programming algorithm选项缺少需要的算法解决办法

MDK开发环境从V4升级到V5后,支持包不再是集成到开发环境当中,而是封装在PACK中,需要自行安装,比较麻烦。 搭建MDK开发环境以及破解的方法,在前面的文章中有详细说明,这里不再赘述,有兴趣的可以...
  • huanzx
  • huanzx
  • 2017年06月15日 18:11
  • 454

Git精简教程,快速上手

由于以前工作一直使用SVN,这次做RN,客户端使用的git来管理源码,所以今天花了点时间来研究git。 目的:以最短的时间上手git,不说原理性的东西,让从未使用过git的人能快速上手。so,让我们开...
  • hsbirenjie
  • hsbirenjie
  • 2016年07月17日 12:40
  • 3592

Git使用总结!

情景1:如何多人协作? 假若你已经clone了别人的仓库并且需要修改,最好的办法是建立自己的分支然后在合并,具体步骤如下: 1.建立一个自己的分支 git branch mybranch 此时...
  • kkkkkxiaofei
  • kkkkkxiaofei
  • 2014年11月25日 20:40
  • 16114

【编程初学者】创建自己的开源项目1-创建远程代码仓库

如果你是一名程序员,并且有着自己的创意,急于想向这个世界分享你的创意,又感觉找不到方向,那么你该认真地读下去了。     假设你已经有一个项目,想让世界上的任何一个人能够下载下来,进行协作开发,那么你...
  • jiao_zg
  • jiao_zg
  • 2017年02月22日 21:57
  • 698

android build (可参考之建立android编译环境)

编译android,建立相对应的编译环境: http://source.android.com/source/initializing.html Initializing a Buil...
  • adaptiver
  • adaptiver
  • 2013年02月08日 20:35
  • 1987

二分图匹配算法Hungarian Algorithm (匈牙利算法)

[转]http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=hungarianAlgorithm Assignment P...
  • seattledream
  • seattledream
  • 2013年03月01日 22:16
  • 1532

风辰的CUDA入门系列教程

风辰的CUDA入门系列教程 1. CUDA简介 GPU是图形处理单元(Graphic Processing Unit)的简称,最初主要用于图形渲染。自九十年代开始,GPU的...
  • maosong00
  • maosong00
  • 2013年11月19日 18:32
  • 6971

vim配置工具分享

非常好用的vim配置工具,大家可以体验一下:点击打开链接,在github上面,下面是链接网页的内容: Maple's Vim config I use vundle to manage my pl...
  • qingen1
  • qingen1
  • 2013年10月07日 20:59
  • 1977

Android 模拟器硬件加速

转载 https://software.intel.com/en-us/blogs/2012/03/12/how-to-start-intel-hardware-assisted-virtualiz...
  • ty3219
  • ty3219
  • 2015年05月17日 16:34
  • 1026
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:Tips for speed up your algorithm in the CUDA programming
举报原因:
原因补充:

(最多只允许输入30个字)