《Learning CUDA Programming》读书笔记(三)

CUDA occupancy:

一般等于:Active Thread Blocks per Multiprocessor / Max Threads per Multiprocessor;分子是用户kernel和GPU硬件条件共同决定的,分母完全由GPU硬件条件决定;

这个occupancy越高(越接近100%),则GPU的SM上驻留的(叫做active)threads就越多,(实际同时运行的threads数目取决于SM的core数目,远小于active threads), 就越有可能把更多的IO给hide起来;(也不一定,如果这个kernel的threads不怎么IO,则这个率低些也不影响计算单元忙碌程度;如果死劲访存IO,这个率100%也照样有处理器空闲发生,即所有线程都在访存中)

CUDA提供了一个Excel表(CUDA Occupancy Calculator),只要填入当前GPU硬件型号和kernel thread所用的寄存器、block用的shared memory(这两者可以从nvcc的输出里得到)、block的threads个数,表格自动给出occupancy等信息;

占用率一般受三个条件的限制:
1.SM最大并发thread数,SM最大并发Warp数,SM最大并发block数(同一个SM上可以同时并行跑好几个block)
2.shared memory资源限制
3.register资源限制

函数头__launch_bound__和编译选项--maxrregcount,可以影响编译器优化寄存器的使用,以达到尽量高的执行效率(优化得太猛也容易导致register退化到显存,手动tune的难度较高)

Profiler里显示的Achieved Occupancy,才是运行时候的真实值;

Reduce操作的GPU并行:

1. Global Memory版本:加和多轮,每轮起一次kernel,只用显存;没法只用一个kernel内嵌循环的原因是无法在kernel内部对grid进行同步;

2. Shared Memory版本:每轮每个block先把自己这段读进自己block的shared-memory,sync, 然后循环加和,循环内部有sync,最后把自己block得到的结果写到显存数组[blockIdx]位置处;下一轮的block数目由待加和的数组长度决定;

3. 把取余%操作换成位运算,有一定速度提升;

(4. 博客里看到过的,用原子操作来把block结果加和到显存变量上,可以一个kernel就结束了,不需要多轮)

优化过程中,nvprof起到指导性作用!

减少Warp Diverge的策略:

1. 让同一个warp只执行一个分支,多个分支让多个warp来执行;

2. 合并使得分支数目减少;

3. 减小分支内部的工作量;

4. 改变数据的布局(例如转置,聚合,...)

5. 把group进行partition(Cooperative Group技术),使用tiled_partition

Reduction加和的例子:

   - 同一个warp里,一部分线程执行,一部分线程不干活,效率低(因为把计算单元占用了?);

    - 一部分warp执行,一部分warp不干活(例如等待在__syncthreads上),效率高(因为不干活的warp把计算单元让出了?)

    - 把相邻两个相加,变成前一半的一个和后一半的一个相加,为什么能快一些(没看懂)?

    - grid-stripe-loops技术,即每个线程搬运和加和好几个元素,比之前的每个线程一个元素,要快;原因:我认为,主要是因为每个线程干活时间比上启动kernel的时间,增加了;次要原因是线程数和block数目可以降到一个合理水平,太多了反而性能下降;人家用cuda的一个API来自动得到每个SM最优启动多少个block(输入kernel, 每个block的thread数目,每个block的shared-memory占用量), 总block数目=SM数目*这个最优数目;

老CUDA硬件是两级同步,即warp级隐式同步,block级显式同步(就__syncthreads()一个API)

新CUDA硬件支持显式的warp级同步原语(三组),每个线程有自己的PC和stack,如下图右侧:

__shfl_down_sync函数,可以把warp内每个线程的一个变量进行组播操作,用mask来控制warp内的哪些线程参与计算,最后warp同步一把再继续;用在Reduce那个案例上,避免了shared-memory的使用,直接访问寄存器,更快;

Cooperative Group技术:

thread_group内的sync,可以避免整个block的sync,更高效;还可以拆分成更小的group;

this_thread_block()得到当前block,具有sync()、group_index()、thread_index()等成员函数,和__syncthreads()、blockIdx、threadIdx对等;

if...else...里的__syncthreads()调用,容易产生死锁;书上的block.sync()例子我认为有问题,还是会死锁;

大牛博客里讲的更清楚:可以用cg::partition(this_thread_block(), 4)把block拆分成更小的thread_group;可以用coalesced_threads()来得到本warp里当前活跃线程构成的group,从而安全的sync;

循环展开大法:

用#pragma unroll来暗示编译器去展开循环,省得自己复制粘贴代码了;

缺点:也许会使用更多寄存器导致occupancy下降;代码增大可能让指令cache的miss率上升;

AtomicAdd这样的原子操作,一般用在每个block的结果加和到全局结果里去;因为block数目不多,所以race不明显;如果一上来每个thread把自己的数AtomicAdd加到全局结果里,则会巨慢,因为把计算串行化了;

Low-precision计算的好处:1. 节省显存带宽,节省显存占用;2. 加快计算;

一般是计算用半精度,结果放在单精度里;

用特殊的指令执行半精度计算(例如__hmul);还支持一次计算2个的half2、float2变量和__hmul2指令;

DP4A指令:SIMD风格的,一次算4组INT8计算;

当任务是Memory-bounded时,FP32、FP16、FP8的速度大概是1:2:4;

CUDA programming: a developer's guide to parallel computing with GPUs. by Shane Cook. Over the past five years there has been a revolution in computing brought about by a company that for successive years has emerged as one of the premier gaming hardware manufacturersdNVIDIA. With the introduction of the CUDA (Compute Unified Device Architecture) programming language, for the first time these hugely powerful graphics coprocessors could be used by everyday C programmers to offload computationally expensive work. From the embedded device industry, to home users, to supercomputers, everything has changed as a result of this. One of the major changes in the computer software industry has been the move from serial programming to parallel programming. Here, CUDA has produced great advances. The graphics processor unit (GPU) by its very nature is designed for high-speed graphics, which are inherently parallel. CUDA takes a simple model of data parallelism and incorporates it into a programming model without the need for graphics primitives. In fact, CUDA, unlike its predecessors, does not require any understanding or knowledge of graphics or graphics primitives. You do not have to be a games programmer either. The CUDA language makes the GPU look just like another programmable device. Throughout this book I will assume readers have no prior knowledge of CUDA, or of parallel programming. I assume they have only an existing knowledge of the C/C++ programming language. As we progress and you become more competent with CUDA, we’ll cover more advanced topics, taking you from a parallel unaware programmer to one who can exploit the full potential of CUDA. For programmers already familiar with parallel programming concepts and CUDA, we’ll be discussing in detail the architecture of the GPUs and how to get the most from each, including the latest Fermi and Kepler hardware. Literally anyone who can program in C or C++ can program with CUDA in a few hours given a little training. Getting from novice CUDA programmer, with a several times speedup to 10 times–plus speedup is what you should be capable of by the end of this book. The book is very much aimed at learning CUDA, but with a focus on performance, having first achieved correctness. Your level of skill and understanding of writing high-performance code, especially for GPUs, will hugely benefit from this text. This book is a practical guide to using CUDA in real applications, by real practitioners. At the same time, however, we cover the necessary theory and background so everyone, no matter what their background, can follow along and learn how to program in CUDA, making this book ideal for both professionals and those studying GPUs or parallel programming.
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值