cudagpus是什么_如何在cuda中的不同gpus之间复制内存

最新推荐文章于 2023-12-27 07:45:00 发布

皮卡学长

最新推荐文章于 2023-12-27 07:45:00 发布

阅读量911

点赞数

文章标签： cudagpus是什么

本文链接：https://blog.csdn.net/weixin_42097208/article/details/112901164

版权

Currently I'm work with two gtx 650 . My program resembles in simple Clients/Server structure. I distribute the work threads on the two gpus. The Server thread need to gather the result vectors from client threads, so I need to copy the memory between the two gpu. Unfortunaly, the simple P2P program in cuda samples just doesn't work because my cards don't have TCC drivers. Spending two hours searching on google and SO, I can't find the answer.Some source says I should use cudaMemcpyPeer , and some other source says I should use cudaMemcpy with cudaMemcpyDefault.Is there some simple way to get my work done other than copy to host then copy to device. I know it must have been documented somewhere, but I can't find it.Thank you for your help.

解决方案

Transferring data from one GPU to another will often require a "staging" through host memory. The exception to this is when the GPUs and the system topology support peer-to-peer (P2P) access and P2P has been explicitly enabled. In that case, data transfers can flow directly over the PCIE bus from one GPU to another.

In either case (with or without P2P being available/enabled) the typical cuda runtime API call would be cudaMemcpyPeer/cudaMemcpyPeerAsync as demonstrated in the cuda p2pBandwidthLatencyTest sample code.

On windows, one of the requirements of P2P is that both devices be supported by a driver in TCC mode. TCC mode is, for the most part, not an available option for GeForce GPUs (recently, an exception is made for GeForce Titan family GPUs using drivers and runtime available in the CUDA 7.5RC toolkit.)

Therefore, on Windows, these GPUs will not be able to take advantage of direct P2P transfers. Nevertheless, a nearly identical sequence can be used to transfer data. The CUDA runtime will detect the nature of the transfer, and perform an allocation "under the hood" to create a staging buffer. The transfer will then be completed in 2 parts: a transfer from the originating device to the staging buffer, and a transfer from the staging buffer to the destination device.

The following is a fully worked example showing how to transfer data from one GPU to another, while taking advantage of P2P access if it is available:

$ cat t850.cu

#include

#define SRC_DEV 0

#define DST_DEV 1

#define DSIZE (8*1048576)

#define cudaCheckErrors(msg) \

do { \

cudaError_t __err = cudaGetLastError(); \

if (__err != cudaSuccess) { \

fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \

msg, cudaGetErrorString(__err), \

__FILE__, __LINE__); \

fprintf(stderr, "*** FAILED - ABORTING\n"); \

exit(1); \

} \

} while (0)

int main(int argc, char *argv[]){

int disablePeer = 0;

if (argc > 1) disablePeer = 1;

int devcount;

cudaGetDeviceCount(&devcount);

cudaCheckErrors("cuda failure");

int srcdev = SRC_DEV;

int dstdev = DST_DEV;

if (devcount <= max(srcdev,dstdev)) {printf("not enough cuda devices for the requested operation\n"); return 1;}

int *d_s, *d_d, *h;

int dsize = DSIZE*sizeof(int);

h = (int *)malloc(dsize);

if (h == NULL) {printf("malloc fail\n"); return 1;}

for (int i = 0; i < DSIZE; i++) h[i] = i;

int canAccessPeer = 0;

if (!disablePeer) cudaDeviceCanAccessPeer(&canAccessPeer, srcdev, dstdev);

cudaSetDevice(srcdev);

cudaMalloc(&d_s, dsize);

cudaMemcpy(d_s, h, dsize, cudaMemcpyHostToDevice);

if (canAccessPeer) cudaDeviceEnablePeerAccess(dstdev,0);

cudaSetDevice(dstdev);

cudaMalloc(&d_d, dsize);

cudaMemset(d_d, 0, dsize);

if (canAccessPeer) cudaDeviceEnablePeerAccess(srcdev,0);

cudaCheckErrors("cudaMalloc/cudaMemset fail");

if (canAccessPeer) printf("Timing P2P transfer");

else printf("Timing ordinary transfer");

printf(" of %d bytes\n", dsize);

cudaEvent_t start, stop;

cudaEventCreate(&start); cudaEventCreate(&stop);

cudaEventRecord(start);

cudaMemcpyPeer(d_d, dstdev, d_s, srcdev, dsize);

cudaCheckErrors("cudaMemcpyPeer fail");

cudaEventRecord(stop);

cudaEventSynchronize(stop);

float et;

cudaEventElapsedTime(&et, start, stop);

cudaSetDevice(dstdev);

cudaMemcpy(h, d_d, dsize, cudaMemcpyDeviceToHost);

cudaCheckErrors("cudaMemcpy fail");

for (int i = 0; i < DSIZE; i++) if (h[i] != i) {printf("transfer failure\n"); return 1;}

printf("transfer took %fms\n", et);

return 0;

}

$ nvcc -arch=sm_20 -o t850 t850.cu

$ ./t850

Timing P2P transfer of 33554432 bytes

transfer took 5.135680ms

$ ./t850 disable

Timing ordinary transfer of 33554432 bytes

transfer took 7.274336ms

Notes:

Passing any command line parameter will disable the use of P2P even if it is available.

The above results are for a system where P2P access is possible, and both GPUs are connected via a PCIE Gen2 link, capable of about 6GB/s transfer bandwidth in a single direction. The P2P transfer time is consistent with this (32MB/5ms ~= 6GB/s). The non-P2P transfer time is longer, but not double. This is due to the fact that for transfers to/from the staging buffer, after some data is transferred into the staging buffer, the outgoing transfer can begin. The driver/runtime takes advantage of this to partially overlap the data transfers.

皮卡学长

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
cudagpus是什么_如何在cuda中的不同gpus之间复制内存

Currently I'm work with two gtx 650 . My program resembles in simple Clients/Server structure. I distribute the work threads on the two gpus. The Server thread need to gather the result vectors from ...
复制链接

扫一扫