《Learning CUDA Programming》读书笔记(二)

GPU编程常见性能瓶颈:

各环节带宽:(注意: L2-cache是被整个GPU上所有SM所共享的)

GPU各存储:

Global Memory:

coalesced访存(合并访存)可以减少访存操作次数和cache-miss次数,所以快;

原理:32个线程访问的float都挨着且对齐存放着,则访问一段128字节的连续内存,则1次访存操作读128个字节到cache和寄存器即可(如果L1-cache或L2-cache已有则复用);如果32个线程访问的float都不挨着,则32次访存操作每个读1个cache-line(128字节),既占用cache,又浪费带宽;

CPU编程一般使用Array of Structure; GPU编程一般使用Structure of Array,即使用1个结构体包含多个数组(每个字段1个数组);这样GPU可以保持coalesced访存; RGB图片处理的例子很好,还是用VisualProfiler比较了性能;

Shared Memory:

由用户管理的cache;

作用:1. 存放重复使用的数据;2.用作threads之间数据共享/交互;

例子:矩阵转置:只使用显存,读或者写必有一方会uncoalesced访存;解决方法:先从显存读进shared-memory,再在shared-memory里转置,(sync一下),再写入显存;可以确保读写显存都是coalesced访存;shared-memory没有非连续访问这个瓶颈了,但要注意bank岔开(用padding实现,每行末尾多pad一个float即可);

每个bank每个cycle只允许读或写1个float;所以要想提高带宽,就使用了多个bank并发访问的架构;

只要同一个warp的threads岔开访问即可,不需要连续地址的访问;

Texture Memory:

用来在kernel里做图像缩放这种2D/3D访问;对GPU而言是只读的;具体访问模式细节隐藏在API后面,用户看不到;

如果不用它,直接用global memory,会造成uncoalesced访存的低效;

Texture memory的访问API自带线性插值功能;

Register:

Kernel里的局部变量和编译器生成的中间结果,优先放在register里;如果不够用,则放到cache或者Local-memory上,叫做register spills了;

SM的register数目有限,所以尽量让kernel里的局部变量少些,block里的thread数目别太多,或者拆成多个kernel来做;

Pinned memory:

解决CPU和GPU之间传输慢于执行的方法:1.尽量少传输,哪怕让GPU干点儿CPU的活儿,CPU干点儿GPU的活儿;2.使用pinned memory可以增大带宽;3.避免小包传输,合并成较大包再传,避免很多小包造成的CUDA API调用延迟;4.使用异步传输和Kernel执行并行重叠起来;

如果CPU直接用malloc分配内存,则:遇到数据传输,CUDA会在主存开辟一个临时pinned memory buffer,把数据先H2H copy到buffer,再H2D从buffer copy到显存;2点损失:1.临时buffer申请、复制、释放;2.原始数据如果已经在磁盘上了,则换入到主存耽误时间;

实验例子:传输数据越大,pinned和pageable的差距越小,书的解释:driver和DMA开始使用overlapping(我的理解是分成小块将H2H和H2D pipeline起来了!)

目前只在Power CPU这个型号的机器上提供CPU和GPU之间的NVLink;

Unified Memory:

cudaMallocManaged; CPU和GPU共享地址空间;谁先touch,就现在谁上开物理内存,后touch就得从先touch的那里copy;

1. CPU上先初始化(先touch),然后GPU开始计算kernel:GPU发现缺页,就先在显存开辟空间,再从主存copy到显存;

2a. GPU先起一个kernel初始化(先touch),然后再起计算kernel:比1省去了copy,page-fault次数仍比较多;

3a. GPU的初始化kernel,让每个warp负责一个page(64KB):即一上来读x同时触发一半page-fault,再读y同时出发另一半page-fault,等于将上一步的多批page-fault缩减为2批,初始化kernel加快了1倍;

2b. 使用cudaMemPrefetchAsync通知CUDA预取(往显存上或者往主存上)

cudaMemAdvise,可以给底层提供更多的hint(水深)

趋势:L1-cache变得越来越大,延迟和带宽也越来越接近shared-memory,供傻瓜程序员避免使用复杂的shared-memory编程;

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
CUDA programming: a developer's guide to parallel computing with GPUs. by Shane Cook. Over the past five years there has been a revolution in computing brought about by a company that for successive years has emerged as one of the premier gaming hardware manufacturersdNVIDIA. With the introduction of the CUDA (Compute Unified Device Architecture) programming language, for the first time these hugely powerful graphics coprocessors could be used by everyday C programmers to offload computationally expensive work. From the embedded device industry, to home users, to supercomputers, everything has changed as a result of this. One of the major changes in the computer software industry has been the move from serial programming to parallel programming. Here, CUDA has produced great advances. The graphics processor unit (GPU) by its very nature is designed for high-speed graphics, which are inherently parallel. CUDA takes a simple model of data parallelism and incorporates it into a programming model without the need for graphics primitives. In fact, CUDA, unlike its predecessors, does not require any understanding or knowledge of graphics or graphics primitives. You do not have to be a games programmer either. The CUDA language makes the GPU look just like another programmable device. Throughout this book I will assume readers have no prior knowledge of CUDA, or of parallel programming. I assume they have only an existing knowledge of the C/C++ programming language. As we progress and you become more competent with CUDA, we’ll cover more advanced topics, taking you from a parallel unaware programmer to one who can exploit the full potential of CUDA. For programmers already familiar with parallel programming concepts and CUDA, we’ll be discussing in detail the architecture of the GPUs and how to get the most from each, including the latest Fermi and Kepler hardware. Literally anyone who can program in C or C++ can program with CUDA in a few hours given a little training. Getting from novice CUDA programmer, with a several times speedup to 10 times–plus speedup is what you should be capable of by the end of this book. The book is very much aimed at learning CUDA, but with a focus on performance, having first achieved correctness. Your level of skill and understanding of writing high-performance code, especially for GPUs, will hugely benefit from this text. This book is a practical guide to using CUDA in real applications, by real practitioners. At the same time, however, we cover the necessary theory and background so everyone, no matter what their background, can follow along and learn how to program in CUDA, making this book ideal for both professionals and those studying GPUs or parallel programming.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值