cudagpus是什么_如何在cuda中的不同gpus之间复制内存

Currently I'm work with two gtx 650 . My program resembles in simple Clients/Server structure. I distribute the work threads on the two gpus. The Server thread need to gather the result vectors from client threads, so I need to copy the memory between the two gpu. Unfortunaly, the simple P2P program in cuda samples just doesn't work because my cards don't have TCC drivers. Spending two hours searching on google and SO, I can't find the answer.Some source says I should use cudaMemcpyPeer , and some other source says I should use cudaMemcpy with cudaMemcpyDefault.Is there some simple way to get my work done other than copy to host then copy to device. I know it must have been documented somewhere, but I can't find it.Thank you for your help.

解决方案

Transferring data from one GPU to another will often require a "staging" through host memory. The exception to this is when the GPUs and the system topology support peer-to-peer (P2P) access and P2P has been explicitly enabled. In that case, data transfers can flow directly over the PCIE bus from one GPU to another.

In either case (with or without P2P being available/enabled) the typical cuda runtime API call would be cudaMemcpyPeer/cudaMemcpyPeerAsync as demonstrated in the cuda p2pBandwidthLatencyTest sample code.

On windows, one of the requirements of P2P is that both devices be supported by a driver in TCC mode. TCC mode is, for the most part, not an available option for GeForce GPUs (recently, an exception is made for GeForce Titan family GPUs using drivers and runtime available in the CUDA 7.5RC toolkit.)

Therefore, on Windows, these GPUs will not be able to take advantage of direct P2P transfers. Nevertheless, a nearly identical sequence can be used to transfer data. The CUDA runtime will detect the nature of the transfer, and perform an allocation "under the hood" to create a staging buffer. The transfer will then be completed in 2 parts: a transfer from the originating device to the staging buffer, and a transfer from the staging buffer to the destination device.

The following is a fully worked example showing how to transfer data from one GPU to another, while taking advantage of P2P access if it is available:

$ cat t850.cu

#include

#include

#define SRC_DEV 0

#define DST_DEV 1

#define DSIZE (8*1048576)

#define cudaCheckErrors(msg) \

do { \

cudaError_t __err = cudaGetLastError(); \

if (__err != cudaSuccess) { \

fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \

msg, cudaGetErrorString(__err), \

__FILE__, __LINE__); \

fprintf(stderr, "*** FAILED - ABORTING\n"); \

exit(1); \

} \

} while (0)

int main(int argc, char *argv[]){

int disablePeer = 0;

if (argc > 1) disablePeer = 1;

int devcount;

cudaGetDeviceCount(&devcount);

cudaCheckErrors("cuda failure");

int srcdev = SRC_DEV;

int dstdev = DST_DEV;

if (devcount <= max(srcdev,dstdev)) {printf("not enough cuda devices for the requested operation\n"); return 1;}

int *d_s, *d_d, *h;

int dsize = DSIZE*sizeof(int);

h = (int *)malloc(dsize);

if (h == NULL) {printf("malloc fail\n"); return 1;}

for (int i = 0; i < DSIZE; i++) h[i] = i;

int canAccessPeer = 0;

if (!disablePeer) cudaDeviceCanAccessPeer(&canAccessPeer, srcdev, dstdev);

cudaSetDevice(srcdev);

cudaMalloc(&d_s, dsize);

cudaMemcpy(d_s, h, dsize, cudaMemcpyHostToDevice);

if (canAccessPeer) cudaDeviceEnablePeerAccess(dstdev,0);

cudaSetDevice(dstdev);

cudaMalloc(&d_d, dsize);

cudaMemset(d_d, 0, dsize);

if (canAccessPeer) cudaDeviceEnablePeerAccess(srcdev,0);

cudaCheckErrors("cudaMalloc/cudaMemset fail");

if (canAccessPeer) printf("Timing P2P transfer");

else printf("Timing ordinary transfer");

printf(" of %d bytes\n", dsize);

cudaEvent_t start, stop;

cudaEventCreate(&start); cudaEventCreate(&stop);

cudaEventRecord(start);

cudaMemcpyPeer(d_d, dstdev, d_s, srcdev, dsize);

cudaCheckErrors("cudaMemcpyPeer fail");

cudaEventRecord(stop);

cudaEventSynchronize(stop);

float et;

cudaEventElapsedTime(&et, start, stop);

cudaSetDevice(dstdev);

cudaMemcpy(h, d_d, dsize, cudaMemcpyDeviceToHost);

cudaCheckErrors("cudaMemcpy fail");

for (int i = 0; i < DSIZE; i++) if (h[i] != i) {printf("transfer failure\n"); return 1;}

printf("transfer took %fms\n", et);

return 0;

}

$ nvcc -arch=sm_20 -o t850 t850.cu

$ ./t850

Timing P2P transfer of 33554432 bytes

transfer took 5.135680ms

$ ./t850 disable

Timing ordinary transfer of 33554432 bytes

transfer took 7.274336ms

$

Notes:

Passing any command line parameter will disable the use of P2P even if it is available.

The above results are for a system where P2P access is possible, and both GPUs are connected via a PCIE Gen2 link, capable of about 6GB/s transfer bandwidth in a single direction. The P2P transfer time is consistent with this (32MB/5ms ~= 6GB/s). The non-P2P transfer time is longer, but not double. This is due to the fact that for transfers to/from the staging buffer, after some data is transferred into the staging buffer, the outgoing transfer can begin. The driver/runtime takes advantage of this to partially overlap the data transfers.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值