CUDA学习(十五)

多设备系统:
设备枚举:
主机系统可以有多个设备。 以下代码示例显示如何枚举这些设备,查询它们的属性以及确定启用CUDA的设备的数量:

int deviceCount;
cudaGetDeviceCount(&deviceCount);
int device;
for (device = 0; device < deviceCount; ++device) {
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, device);
    printf("Device %d has compute capability %d.%d.\n",
        device, deviceProp.major, deviceProp.minor);
}

设备选择:
主机线程可以随时通过调用cudaSetDevice()来设置其运行的设备。 设备内存分配和内核启动是在当前设置的设备上进行的; 流和事件是与当前设置的设备关联创建的。 如果没有调用cudaSetDevice(),则当前设备是设备0。
以下代码示例说明如何设置当前设备影响内存分配和内核执行:

size_t size = 1024 * sizeof(float);
cudaSetDevice(0); // Set device 0 as current
float* p0;
cudaMalloc(&p0, size); // Allocate memory on device 0
MyKernel << <1000, 128 >> >(p0); // Launch kernel on device 0
cudaSetDevice(1); // Set device 1 as current
float* p1;
cudaMalloc(&p1, size); // Allocate memory on device 1
MyKernel << <1000, 128 >> >(p1); // Launch kernel on device 1

流和事件行为:
如果发布到与当前设备无关的流,内核启动将失败,如下面的代码示例所示

cudaSetDevice(0); // Set device 0 as current
cudaStream_t s0;
cudaStreamCreate(&s0); // Create stream s0 on device 0
MyKernel<<<100, 64, 0, s0>>>(); // Launch kernel on device 0 in s0
cudaSetDevice(1); // Set device 1 as current
cudaStream_t s1;
cudaStreamCreate(&s1); // Create stream s1 on device 1
MyKernel<<<100, 64, 0, s1>>>(); // Launch kernel on device 1 in s1
// This kernel launch will fail:
MyKernel<<<100, 64, 0, s0>>>(); // Launch kernel on device 1 in s0

即使内存副本发布到与当前设备不相关的流,内存副本也会成功。
如果输入事件和输入流关联到不同的设备,则cudaEventRecord()将失败。
如果两个输入事件关联到不同的设备,则cudaEventElapsedTime()将失败;
即使输入事件关联到与当前设备不同的设备,cudaEventSynchronize()和cudaEventQuery()也会成功;
即使输入流和输入事件关联到不同的设备,cudaStreamWaitEvent()也会成功。 因此可以使用cudaStreamWaitEvent()来使多个设备彼此同步
每个设备都有其自己的默认流(参见默认流),因此发送到设备的默认流的命令可以不按顺序执行或者同时发生到发送到任何其他设备的默认流的命令;
对等内存访问:
当应用程序以64位进程运行时,Tesla系列计算能力为2.0或更高的设备可以寻址对方的内存(即在一个设备上执行的内核可以将指针取消引用到另一个设备的内存)。 如果cudaDeviceCanAccessPeer()为这两个设备返回true,则在两个设备之间支持此对等内存访问功能;
对等内存访问必须通过调用cudaDeviceEnablePeerAccess()在两个设备之间启用,如以下代码示例所示。 每个设备最多可以支持8个对等连接。
两个设备都使用统一的地址空间(请参阅统一虚拟地址空间),因此可以使用同一个指针来寻址来自两个设备的内存,如下面的代码示例所示:

cudaSetDevice(0); // Set device 0 as current
float* p0;
size_t size = 1024 * sizeof(float);
cudaMalloc(&p0, size); // Allocate memory on device 0
MyKernel<<<1000, 128>>>(p0); // Launch kernel on device 0
cudaSetDevice(1); // Set device 1 as current
cudaDeviceEnablePeerAccess(0, 0); // Enable peer-to-peer access
// with device 0
// Launch kernel on device 1
// This kernel launch can access memory on device 0 at address p0
MyKernel<<<1000, 128>>>(p0);

对等内存复制:
内存复制可以在两个不同设备的内存之间执行。 如果两个设备使用统一的地址空间(请参阅统一虚拟地址空间),则使用设备内存中提到的常规内存复制功能完成此操作。
否则,这是使用cudaMemcpyPeer(),cudaMemcpyPeerAsync(),cudaMemcpy3DPeer()或cudaMemcpy3DPeerAsync()完成的,如下面的代码示例所示。

cudaSetDevice(0); // Set device 0 as current
float* p0;
size_t size = 1024 * sizeof(float);
cudaMalloc(&p0, size); // Allocate memory on device 0
cudaSetDevice(1); // Set device 1 as current
float* p1;
cudaMalloc(&p1, size); // Allocate memory on device 1
cudaSetDevice(0); // Set device 0 as current
MyKernel<<<1000, 128>>>(p0); // Launch kernel on device 0
cudaSetDevice(1); // Set device 1 as current
cudaMemcpyPeer(p1, 1, p0, 0, size); // Copy p0 to p1
MyKernel<<<1000, 128>>>(p1); // Launch kernel on device 1

两个不同设备的存储器之间的副本(在隐式NULL流中):

  • 直到所有先前发送给任一设备的命令都已经完成
  • 在复制到任一设备之后发出的任何命令(请参见异步并发执行)可以启动之前运行完成。
    与流的正常行为一致,两个设备的存储器之间的异步副本可能与另一个流中的副本或内核重叠。

请注意,如果通过两台设备之间启用了对等访问
cudaDeviceEnablePeerAccess()如对等存储器访问中所描述的,这两个设备之间的对等存储器副本不再需要通过主机进行分段,因此更快
v2_4059c275d590f3bd789588c905c1755e_hd

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
CUDA programming: a developer's guide to parallel computing with GPUs. by Shane Cook. Over the past five years there has been a revolution in computing brought about by a company that for successive years has emerged as one of the premier gaming hardware manufacturersdNVIDIA. With the introduction of the CUDA (Compute Unified Device Architecture) programming language, for the first time these hugely powerful graphics coprocessors could be used by everyday C programmers to offload computationally expensive work. From the embedded device industry, to home users, to supercomputers, everything has changed as a result of this. One of the major changes in the computer software industry has been the move from serial programming to parallel programming. Here, CUDA has produced great advances. The graphics processor unit (GPU) by its very nature is designed for high-speed graphics, which are inherently parallel. CUDA takes a simple model of data parallelism and incorporates it into a programming model without the need for graphics primitives. In fact, CUDA, unlike its predecessors, does not require any understanding or knowledge of graphics or graphics primitives. You do not have to be a games programmer either. The CUDA language makes the GPU look just like another programmable device. Throughout this book I will assume readers have no prior knowledge of CUDA, or of parallel programming. I assume they have only an existing knowledge of the C/C++ programming language. As we progress and you become more competent with CUDA, we’ll cover more advanced topics, taking you from a parallel unaware programmer to one who can exploit the full potential of CUDA. For programmers already familiar with parallel programming concepts and CUDA, we’ll be discussing in detail the architecture of the GPUs and how to get the most from each, including the latest Fermi and Kepler hardware. Literally anyone who can program in C or C++ can program with CUDA in a few hours given a little training. Getting from novice CUDA programmer, with a several times speedup to 10 times–plus speedup is what you should be capable of by the end of this book. The book is very much aimed at learning CUDA, but with a focus on performance, having first achieved correctness. Your level of skill and understanding of writing high-performance code, especially for GPUs, will hugely benefit from this text. This book is a practical guide to using CUDA in real applications, by real practitioners. At the same time, however, we cover the necessary theory and background so everyone, no matter what their background, can follow along and learn how to program in CUDA, making this book ideal for both professionals and those studying GPUs or parallel programming.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值